CLIP for Dense Tasks

Last updated on February 8, 2024

#! https://zhuanlan.zhihu.com/p/681837427

Extract Free Dense Labels from CLIP (MaskCLIP)

(ECCV 2022 Oral)

以 patch 为单位与 text embeddings 点乘做分类

Perceptual Grouping in Contrastive Vision-Language Models (CLIPpy)

(ICCV 2023)

Open Vocabulary Semantic Segmentation with Patch Aligned Contrastive Learning (PACL)

(CVPR 2023 Highlight)

Observation

  • alignment between image and text at a patch level does not necessarily exist
  • CLIP’s vision encoders outperform DINO on semantic coherence (semantically similar regions in images should produce similar patch representations in the vision encoder)

Method

Patch Aligned Contrastive Learning: take a weighted sum over vision patch embeddings where the weights are obtained from the patch-level similarities with the text embedding.

Experiments

  • Only train the small vision embedder on 30M image-text datasets.

  • Zero-shot Semantic Segmentation:

    • Stride trick at inference: a change to the stride of the convolutional layer to extract image patches in ViT can provide better fine-grained patches at inference time.
  • 有趣的发现:

    • CLIP B/16 > CLIP L/14 > DINO B/16
    • PACL 效果与模型的 semantic coherence 正相关

SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation

(ICML 2023)

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

(ICLR 2024 Spotlight)

Observation

比较 CLIP Resnet/ViT,Image Crop/Dense Feature 的 classification accuracy,发现:

  • 对 whole image 和 cropped image 分类,ViT > Resnet
  • 用 representations pooled from the feature map 分类,ViT 点掉了很多,Resnet 基本不受影响

对 feature map 做 K-Means 聚类并可视化,表明 ViT 的比较杂乱。

可能原因:

  • ViT lacks local inductive bias, hindering the smooth transfer from representing pixels of a whole image to representing pixels of a local image region.
  • each spot on the dense feature map of the CLIP ViT tends to encode the global image.

通过 Retrieval experiment on images and regions 验证:the dense features are well matched with the corresponding images, indicating that each location on the dense feature map tends to encode a global image representation.

Method

方法很简单:既然 Image Crop 比 Dense Feature 分类效果好,那就用 the representations of image crops 蒸馏 the region representations pooled from the dense feature map!

  • Image Patches as Regions: $m \times n$ patches.
  • Self-Distillation: align the region representations pooled from dense feature maps to the image representations of the corresponding image crops.

$$
\mathcal{L} = \frac{1}{m \times n} \sum\limits_{i=0}^{m-1} \sum\limits_{j=0}^{n-1} \left(1-\frac{s_{\text {dense }}^{i j} \cdot t_{\text {image }}^{i j}}{\left|s_{\text {dense }}^{i j}\right| \cdot \left|t_{\text {image }}^{i j}\right|}\right)
$$

Experiments

  • fine-tune CLIP on COCO

  • K-Means visualization:

  • Open-Vocabulary Dense Prediction: 用作 Open-Vocabulary Object Detection (F-VLM) / Semantic Segmentation (Cat-Seg) / Panoptic Segmentation (ODISE) 的 backbone,点数能涨不少。


CLIP for Dense Tasks
https://sydcs.github.io/2024/02/08/CLIP4Dense/
Author
Ye
Posted on
February 8, 2024
Licensed under