CLIP for Dense Tasks

Last updated on February 8, 2024

#! https://zhuanlan.zhihu.com/p/681837427

Extract Free Dense Labels from CLIP (MaskCLIP)

(ECCV 2022 Oral)

以 patch 为单位与 text embeddings 点乘做分类

Perceptual Grouping in Contrastive Vision-Language Models (CLIPpy)

(ICCV 2023)

Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning (PACL)

(CVPR 2023 Highlight)

Observation

alignment between image and text at a patch level does not necessarily exist
CLIP’s vision encoders outperform DINO on semantic coherence (semantically similar regions in images should produce similar patch representations in the vision encoder)

Method

Patch Aligned Contrastive Learning: take a weighted sum over vision patch embeddings where the weights are obtained from the patch-level similarities with the text embedding.

Experiments

Only train the small vision embedder on 30M image-text datasets.
Zero-shot Semantic Segmentation:
- Stride trick at inference: a change to the stride of the convolutional layer to extract image patches in ViT can provide better fine-grained patches at inference time.
有趣的发现：
- CLIP B/16 > CLIP L/14 > DINO B/16
- PACL 效果与模型的 semantic coherence 正相关

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

(ICML 2023)

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

(ICLR 2024 Spotlight)

Observation

比较 CLIP Resnet/ViT，Image Crop/Dense Feature 的 classification accuracy，发现：

对 whole image 和 cropped image 分类，ViT > Resnet
用 representations pooled from the feature map 分类，ViT 点掉了很多，Resnet 基本不受影响

对 feature map 做 K-Means 聚类并可视化，表明 ViT 的比较杂乱。

可能原因：

ViT lacks local inductive bias, hindering the smooth transfer from representing pixels of a whole image to representing pixels of a local image region.
each spot on the dense feature map of the CLIP ViT tends to encode the global image.

通过 Retrieval experiment on images and regions 验证：the dense features are well matched with the corresponding images, indicating that each location on the dense feature map tends to encode a global image representation.

Method

方法很简单：既然 Image Crop 比 Dense Feature 分类效果好，那就用 the representations of image crops 蒸馏 the region representations pooled from the dense feature map！

Image Patches as Regions: $m \times n$ patches.
Self-Distillation: align the region representations pooled from dense feature maps to the image representations of the corresponding image crops.

$$
\mathcal{L} = \frac{1}{m \times n} \sum\limits_{i=0}^{m-1} \sum\limits_{j=0}^{n-1} \left(1-\frac{s_{\text {dense }}^{i j} \cdot t_{\text {image }}^{i j}}{\left|s_{\text {dense }}^{i j}\right| \cdot \left|t_{\text {image }}^{i j}\right|}\right)
$$

Experiments

fine-tune CLIP on COCO
K-Means visualization:
Open-Vocabulary Dense Prediction: 用作 Open-Vocabulary Object Detection (F-VLM) / Semantic Segmentation (Cat-Seg) / Panoptic Segmentation (ODISE) 的 backbone，点数能涨不少。

CLIP for Dense Tasks

https://sydcs.github.io/2024/02/08/CLIP4Dense/

Author

Posted on

February 8, 2024

Licensed under

Fantastic DINO and Why They Pick It Next