CNNs, object detection, segmentation, vision transformers and image processing fundamentals.
Key Concepts to Know
Practice Computer Vision with AI
Timed session with instant scoring, voice support, and model answers.
4 Interview Questions
Browse all topics →What is the SAM (Segment Anything Model) and what makes it special?
Model Answer
SAM (Meta 2023) is a promptable image segmentation model trained on 1.1B masks (SA-1B dataset). Key innovation: zero-shot generalization — segment any object given a point, box, or text prompt without task-specific training. Architecture: Image Encoder (ViT-H), Prompt Encoder (points, boxes, masks), Mask Decoder (outputs masks + confidence). Can generate mask for any region when prompted. Applications: data annotation automation, medical imaging, robotics. SAM 2 (2024) extends to video with memory attention for tracking objects across frames.
What is the Vision Transformer (ViT) and how does it apply attention to images?
Model Answer
ViT splits an image into fixed-size patches (e.g., 16×16 pixels), flattens and linearly embeds them, then processes them as a sequence of tokens with standard transformer self-attention. A special [CLS] token aggregates image information for classification. Trained on large datasets (ImageNet-21K, JFT-300M), ViT matches or exceeds CNN performance. Advantages: captures long-range dependencies naturally (attention between distant patches), scalable with data/compute. CLIP (OpenAI) uses ViT as the image encoder. DINOv2 (Meta) is a powerful self-supervised ViT.
What is the Vision Transformer (ViT) and how does it apply attention to images?
Model Answer
ViT splits an image into fixed-size patches (e.g., 16×16 pixels), flattens and linearly embeds them, then processes them as a sequence of tokens with standard transformer self-attention. A special [CLS] token aggregates image information for classification. Trained on large datasets (ImageNet-21K, JFT-300M), ViT matches or exceeds CNN performance. Advantages: captures long-range dependencies naturally (attention between distant patches), scalable with data/compute. CLIP (OpenAI) uses ViT as the image encoder. DINOv2 (Meta) is a powerful self-supervised ViT.
What is the SAM (Segment Anything Model) and what makes it special?
Model Answer
SAM (Meta 2023) is a promptable image segmentation model trained on 1.1B masks (SA-1B dataset). Key innovation: zero-shot generalization — segment any object given a point, box, or text prompt without task-specific training. Architecture: Image Encoder (ViT-H), Prompt Encoder (points, boxes, masks), Mask Decoder (outputs masks + confidence). Can generate mask for any region when prompted. Applications: data annotation automation, medical imaging, robotics. SAM 2 (2024) extends to video with memory attention for tracking objects across frames.
Related Topics