openssl-simcore
openssl-simcore copied to clipboard
(CVPR 2023) Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
OpenSSL-SimCore (CVPR 2023)
Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning
Sungnyun Kim*,
Sangmin Bae*,
Se-Young Yun
* equal contribution
- Open-set Self-Supervised Learning (OpenSSL) task: an unlabeled open-set available during the pretraining phase on the fine-grained dataset.
- SimCore: simple coreset selection algorithm to leverage a subset semantically similar to the target dataset.
- SimCore significantly improves representation learning performance in various downstream tasks.
- [update on 10.02.2023] Shared SimCore-pretrained models on HuggingFace Models.
Requirements
Install the necessary packages with:
$ pip install -r requirements.txt
Data Preparation
We used 11 fine-grained datasets and 7 open-sets.
Place each data files into data/[DATASET_NAME]/ (it should be constructed as the torchvision.datasets.ImageFolder format).
To download and setup the data, please see the docs and run python files, if necessary.
$ cd data/
$ python [DATASET_NAME]_image_folder_generator.py
Pretraining
To simply pretrain the model, run the shell file. (We support multi-GPUs training, while we utilized 4 GPUs.)
You will need to define the path for each dataset, and the retrieval model checkpoint.
# specify $TAG and $DATA
$ CUDA_VISIBLE_DEVICES=<GPU_ID> bash run_selfsup.sh
Here are some important arguments to be considered.
--dataset1: fine-grained target dataset name--dataset2: open-set name (default: imagenet)--data_folder1: directory where thedataset1is located--data_folder2: directory where thedataset2is located--retrieval_ckpt: retrieval model checkpoint before SimCore pretraining; for this, pretrain vanilla SSL for 1K epochs--model: model architecture (default: resnet50), see models--method: self-supervised learning method (default: simclr), see ssl--sampling_method: strategy for sampling from the open-set (choose between "random" or "simcore")--no_sampling: if sampling unwanted (vanilla SSL pretrain), set this True
The pretrained model checkpoints will be saved at save/[EXP_NAME]/. For example, if you run the default shell file, the last epoch checkpoint will be saved as save/$DATA_resnet50_pretrain_simclr_merge_imagenet_$TAG/last.pth.
Linear Evaluation
Linear evaluation of the pretrained models can be similarly implemented as the pretraining.
Run the following shell file, with the pretrained model checkpoint additionally defined.
# specify $TAG, $DATA, and --pretrained_ckpt
$ CUDA_VISIBLE_DEVICES=<GPU_ID> bash run_sup.sh
We also support kNN evaluation (--knn, --topk) and semi-supervised fine-tuning (--label_ratio, --e2e).
Result
SimCore with a stopping criterion highly improves the accuracy by +10.5% (averaged over 11 datasets), compared to the pretraining without any open-set.
Try other open-sets
SimCore works with various, or even uncurated open-sets. You can also try with your custom, web-crawled, or uncurated open-sets.
Downstream Tasks
SimCore is extensively evaluated in various downstream tasks.
We thus provide the training and evaluation codes for following downstream tasks.
For more details, please see the docs and downstream/ directory.
- object detection
- pixel-wise segmentation
- open-set semi-supervised learning
- webly supervised learning
- semi-supervised learning
- active learning
- hard negative mining
Use the pretrained model checkpoint to run each downstream task.
BibTeX
If you find this repo useful for your research, please consider citing our paper:
@article{kim2023coreset,
title={Coreset Sampling from Open-Set for Fine-Grained Self-Supervised Learning},
author={Kim, Sungnyun and Bae, Sangmin and Yun, Se-Young},
journal={arXiv preprint arXiv:2303.11101},
year={2023}
}
Contact
- Sungnyun Kim: [email protected]
- Sangmin Bae: [email protected]