Fantastic DINO and Why They Pick It

Last updated on February 8, 2024

Emerging Properties in Self-Supervised Vision Transformers (DINO)

(ICCV 2021)

self-supervised ViT features contain explicit information about the semantic segmentation of an image:

(BMVC 2021)

Object Discovery

leverage high-quality features obtained from DINO:

use the key component of the last attention layer for computing the similarities between the different patches
localize a part of an object by selecting the patch with the least number of similar patches

(CVPR 2023)

前/背景分割

leverage attention maps in DINO:

select one of the patches that received the least attention
- some heads are noisy $\Rightarrow$ reduce the effect of noisy attention maps based on the sparsity concept
the background mask incorporates patches similar to this mined one

(ICLR 2023)

reconstruct DINO features that have a high level of homogeneity within objects

(ICLR 2023 notable top 25%)

the attention maps of their DINO approach are not strong enough on a broad enough set of images to kickstart unsupervised semantic segmentation
but their learned features within an object region yield clusters of surprisingly high purity and align well with underlying object categories