Fantastic DINO and Why They Pick It
Last updated on February 8, 2024
#! https://zhuanlan.zhihu.com/p/681763809
Emerging Properties in Self-Supervised Vision Transformers (DINO)
(ICCV 2021)
self-supervised ViT features contain explicit information about the semantic segmentation of an image:
- explicit semantic information in attention maps
- different heads can attend to different semantic regions of an image
- 取 attention map 中前 60% 的 patches 直接分割;supervised ViT 没有这样的好性质
- 选一些 reference points 算 self-attention
Localizing Objects with Self-Supervised Transformers and no Labels (LOST)
(BMVC 2021)
Object Discovery
leverage high-quality features obtained from DINO:
- use the key component of the last attention layer for computing the similarities between the different patches
- localize a part of an object by selecting the patch with the least number of similar patches
Unsupervised Object Localization: Observing the Background to Discover Objects (FOUND)
(CVPR 2023)
前/背景分割
leverage attention maps in DINO:
- select one of the patches that received the least attention
- some heads are noisy $\Rightarrow$ reduce the effect of noisy attention maps based on the sparsity concept
- some heads are noisy $\Rightarrow$ reduce the effect of noisy attention maps based on the sparsity concept
- the background mask incorporates patches similar to this mined one
Bridging the Gap to Real-World Object-Centric Learning (DINOSAUR)
(ICLR 2023)
reconstruct DINO features that have a high level of homogeneity within objects
Unsupervised Semantic Segmentation with Self-supervised Object-centric Representations
(ICLR 2023 notable top 25%)
- the attention maps of their DINO approach are not strong enough on a broad enough set of images to kickstart unsupervised semantic segmentation
- but their learned features within an object region yield clusters of surprisingly high purity and align well with underlying object categories
CLIP-DINOiser: Teaching CLIP a few DINO tricks
integrate localization priors extracted from DINO:
- use the value embeddings have finer correlation than those of key
- DINO features are more densely and accurately correlated than those of CLIP
Vision Transformers Need Registers
(ICLR 2024 Oral)
探究 Why artifacts emerge in the attention maps of ViTs except DINO?
Fantastic DINO and Why They Pick It
https://sydcs.github.io/2024/02/08/DINO/