Computer Vision Interview Questions — CNN, YOLO, ViT, Segmentation | AmanAI Lab

senior

What is the SAM (Segment Anything Model) and what makes it special?

Model Answer

SAM (Meta 2023) is a promptable image segmentation model trained on 1.1B masks (SA-1B dataset). Key innovation: zero-shot generalization — segment any object given a point, box, or text prompt without task-specific training. Architecture: Image Encoder (ViT-H), Prompt Encoder (points, boxes, masks), Mask Decoder (outputs masks + confidence). Can generate mask for any region when prompted. Applications: data annotation automation, medical imaging, robotics. SAM 2 (2024) extends to video with memory attention for tracking objects across frames.

mid

What is the Vision Transformer (ViT) and how does it apply attention to images?

Model Answer

ViT splits an image into fixed-size patches (e.g., 16×16 pixels), flattens and linearly embeds them, then processes them as a sequence of tokens with standard transformer self-attention. A special [CLS] token aggregates image information for classification. Trained on large datasets (ImageNet-21K, JFT-300M), ViT matches or exceeds CNN performance. Advantages: captures long-range dependencies naturally (attention between distant patches), scalable with data/compute. CLIP (OpenAI) uses ViT as the image encoder. DINOv2 (Meta) is a powerful self-supervised ViT.

mid

What is the Vision Transformer (ViT) and how does it apply attention to images?

Model Answer

ViT splits an image into fixed-size patches (e.g., 16×16 pixels), flattens and linearly embeds them, then processes them as a sequence of tokens with standard transformer self-attention. A special [CLS] token aggregates image information for classification. Trained on large datasets (ImageNet-21K, JFT-300M), ViT matches or exceeds CNN performance. Advantages: captures long-range dependencies naturally (attention between distant patches), scalable with data/compute. CLIP (OpenAI) uses ViT as the image encoder. DINOv2 (Meta) is a powerful self-supervised ViT.

senior

What is the SAM (Segment Anything Model) and what makes it special?

Model Answer

SAM (Meta 2023) is a promptable image segmentation model trained on 1.1B masks (SA-1B dataset). Key innovation: zero-shot generalization — segment any object given a point, box, or text prompt without task-specific training. Architecture: Image Encoder (ViT-H), Prompt Encoder (points, boxes, masks), Mask Decoder (outputs masks + confidence). Can generate mask for any region when prompted. Applications: data annotation automation, medical imaging, robotics. SAM 2 (2024) extends to video with memory attention for tracking objects across frames.