Fantastic DINO and Why They Pick It

Last updated on February 8, 2024

#! https://zhuanlan.zhihu.com/p/681763809

Emerging Properties in Self-Supervised Vision Transformers (DINO)

(ICCV 2021)

self-supervised ViT features contain explicit information about the semantic segmentation of an image:

  • explicit semantic information in attention maps
  • different heads can attend to different semantic regions of an image
  • 取 attention map 中前 60% 的 patches 直接分割;supervised ViT 没有这样的好性质
  • 选一些 reference points 算 self-attention

Localizing Objects with Self-Supervised Transformers and no Labels (LOST)

(BMVC 2021)

Object Discovery

leverage high-quality features obtained from DINO:

  • use the key component of the last attention layer for computing the similarities between the different patches
  • localize a part of an object by selecting the patch with the least number of similar patches

Unsupervised Object Localization: Observing the Background to Discover Objects (FOUND)

(CVPR 2023)

前/背景分割

leverage attention maps in DINO:

  • select one of the patches that received the least attention
    • some heads are noisy $\Rightarrow$ reduce the effect of noisy attention maps based on the sparsity concept
  • the background mask incorporates patches similar to this mined one

Bridging the Gap to Real-World Object-Centric Learning (DINOSAUR)

(ICLR 2023)

reconstruct DINO features that have a high level of homogeneity within objects

Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations

(ICLR 2023 notable top 25%)

  • the attention maps of their DINO approach are not strong enough on a broad enough set of images to kickstart unsupervised semantic segmentation
  • but their learned features within an object region yield clusters of surprisingly high purity and align well with underlying object categories

CLIP-DINOiser: Teaching CLIP a few DINO tricks

integrate localization priors extracted from DINO:

  • use the value embeddings have finer correlation than those of key
  • DINO features are more densely and accurately correlated than those of CLIP

Vision Transformers Need Registers

(ICLR 2024 Oral)

探究 Why artifacts emerge in the attention maps of ViTs except DINO?


Fantastic DINO and Why They Pick It
https://sydcs.github.io/2024/02/08/DINO/
Author
Ye
Posted on
February 8, 2024
Licensed under