SelfPatch
SelfPatch copied to clipboard
Patch-level Representation Learning for Self-supervised Vision Transformers (SelfPatch)
PyTorch implementation for "Patch-level Representation Learning for Self-supervised Vision Transformers" (accepted Oral presentation in CVPR 2022)
Requirements
-
torch==1.7.0
-
torchvision==0.8.1
Pretraining on ImageNet
python -m torch.distributed.launch --nproc_per_node=8 main_selfpatch.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir --local_crops_number 8 --patch_size 16 --batch_size_per_gpu 128 --out_dim_selfpatch 4096 --k_num 4
Pretrained weights on ImageNet
You can download the weights of the pretrained models on ImageNet. All models are trained on ViT-S/16
. For detection and segmentation downstream tasks, please check SelfPatch/detection, SelfPatch/segmentation.
backbone | arch | checkpoint |
---|---|---|
DINO | ViT-S/16 | download (pretrained model from VISSL) |
DINO + SelfPatch | ViT-S/16 | download |
Evaluating video object segmentation on the DAVIS 2017 dataset
Step 1. Prepare DAVIS 2017 data
cd $HOME
git clone https://github.com/davisvideochallenge/davis-2017
cd davis-2017
./data/get_davis.sh
Step 2. Run Video object segmentation
python eval_video_segmentation.py --data_path /path/to/davis-2017/DAVIS/ --output_dir /path/to/saving_dir --pretrained_weights /path/to/model_dir --arch vit_small --patch_size 16
Step 3. Evaluate the obtained segmentation
git clone https://github.com/davisvideochallenge/davis2017-evaluation
$HOME/davis2017-evaluation
python /path/to/davis2017-evaluation/evaluation_method.py --task semi-supervised --davis_path /path/to/davis-2017/DAVIS --results_path /path/to/saving_dir
Video object segmentation examples on the DAVIS 2017 dataset
Video (left), DINO (middle) and our SelfPatch (right)
Acknowledgement
Our code base is built partly upon the packages: DINO, mmdetection, mmsegmentation and XCiT
Citation
If you use this code for your research, please cite our papers.
@InProceedings{Yun_2022_CVPR,
author = {Yun, Sukmin and Lee, Hankook and Kim, Jaehyung and Shin, Jinwoo},
title = {Patch-Level Representation Learning for Self-Supervised Vision Transformers},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022},
pages = {8354-8363}
}