CLIP for Dense Tasks
Last updated on February 8, 2024
#! https://zhuanlan.zhihu.com/p/681837427
Extract Free Dense Labels from CLIP (MaskCLIP)
(ECCV 2022 Oral)
以 patch 为单位与 text embeddings 点乘做分类
Perceptual Grouping in Contrastive Vision-Language Models (CLIPpy)
(ICCV 2023)
Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning (PACL)
(CVPR 2023 Highlight)
Observation
- alignment between image and text at a patch level does not necessarily exist
- CLIP’s vision encoders outperform DINO on semantic coherence (semantically similar regions in images should produce similar patch representations in the vision encoder)
Method
Patch Aligned Contrastive Learning: take a weighted sum over vision patch embeddings where the weights are obtained from the patch-level similarities with the text embedding.
Experiments
Only train the small vision embedder on 30M image-text datasets.
Zero-shot Semantic Segmentation:
- Stride trick at inference: a change to the stride of the convolutional layer to extract image patches in ViT can provide better fine-grained patches at inference time.
- Stride trick at inference: a change to the stride of the convolutional layer to extract image patches in ViT can provide better fine-grained patches at inference time.
有趣的发现:
- CLIP B/16 > CLIP L/14 > DINO B/16
- PACL 效果与模型的 semantic coherence 正相关
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation
(ICML 2023)
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
(ICLR 2024 Spotlight)
Observation
比较 CLIP Resnet/ViT,Image Crop/Dense Feature 的 classification accuracy,发现:
- 对 whole image 和 cropped image 分类,ViT > Resnet
- 用 representations pooled from the feature map 分类,ViT 点掉了很多,Resnet 基本不受影响
对 feature map 做 K-Means 聚类并可视化,表明 ViT 的比较杂乱。
可能原因:
- ViT lacks local inductive bias, hindering the smooth transfer from representing pixels of a whole image to representing pixels of a local image region.
- each spot on the dense feature map of the CLIP ViT tends to encode the global image.
通过 Retrieval experiment on images and regions 验证:the dense features are well matched with the corresponding images, indicating that each location on the dense feature map tends to encode a global image representation.
Method
方法很简单:既然 Image Crop 比 Dense Feature 分类效果好,那就用 the representations of image crops 蒸馏 the region representations pooled from the dense feature map!
- Image Patches as Regions: $m \times n$ patches.
- Self-Distillation: align the region representations pooled from dense feature maps to the image representations of the corresponding image crops.
$$
\mathcal{L} = \frac{1}{m \times n} \sum\limits_{i=0}^{m-1} \sum\limits_{j=0}^{n-1} \left(1-\frac{s_{\text {dense }}^{i j} \cdot t_{\text {image }}^{i j}}{\left|s_{\text {dense }}^{i j}\right| \cdot \left|t_{\text {image }}^{i j}\right|}\right)
$$
Experiments
fine-tune CLIP on COCO
K-Means visualization:
Open-Vocabulary Dense Prediction: 用作 Open-Vocabulary Object Detection (F-VLM) / Semantic Segmentation (Cat-Seg) / Panoptic Segmentation (ODISE) 的 backbone,点数能涨不少。