gem icon indicating copy to clipboard operation
gem copied to clipboard

A Pytorch-based library to evaluate learning methods on small image classification datasets

:gem: GEM: Generalization-Efficient Methods for image classification with small datasets

GEM is a PyTorch-based library with the goal of providing a shared codebase for fast prototyping, training and reproducible evaluation of learning algorithms that generalize on small image datasets.

In particular, the repository contains all the tools to reproduce and possibly extend the experiments of the paper Image Classification with Small Datasets: Overview and Benchmark. It provides:

  • [x] A (possibly extendable) benchmark of 5 datasets spanning various data domains and types
  • [x] A realistic and fair experimental pipeline including hyper-parameter optimization and common training set-ups
  • [x] A (possibly extendable) large pool of implementations for state-of-the-art methods

Given the "living" nature of our libary, we plan in the future to introduce and keep the repository updated with new approaches and datasets to drive further progress toward small-sample learning methods.

:bookmark_tabs: Table of Contents

  • Overview
    • Structure
    • Datasets
    • Methods
  • Usage
    • Installation
    • Method Evaluation
    • Library Extension
  • Results
  • Citation

:book: Overview

Structure

More details soon!

Datasets

The datasets constituting our benchmark are the following:

Dataset Classes Imgs/Class Trainval Test Problem Domain Data Type Identifier
ciFAIR-10* 10 50 500 10,000 Natural Images RGB (32x32) cifair10
CUB 200 30 5,994 5,794 Fine-Grained RGB cub
ISIC 2018* 7 80 560 1,944 Medical RGB isic2018
EuroSAT* 10 50 500 19,500 Remote Sensing Multispectral eurosat
CLaMM* 12 50 600 2,000 Handwriting Grayscale clamm

* We use subsampled versions of the original datasets with fewer images per class.

For additional details on the dataset statistics, splits, and ways to download the data, visit the respective page in the folder datasets. The directory contains one sub-directory for each dataset in our benchmark. These directories contain the split files specifying the subsets of data employed in our experiments. The files trainval{i}.txt are simply the concatenation of train{i}.txt and val{i}.txt (with i in {0,1,2}). These subsets can be used for the final training before evaluating a method on the test set. Development and hyper-parameter optimization (HPO), however, should only be conducted using the training and validation sets.

The aforementioned files list all images contained in the respective subset, one per line, along with their class labels. Each line contains the filename of an image followed by a space and the numeric index of its label.

The only exception from this common format is ciFAIR-10, since it does not have filenames. A description of the split can be found in the README of the respective directory.

Methods

We currently provide the implementations of the following methods:

Method Original code Our implementation Identifier
Cross-Entropy Loss (baseline) -- xent.py xent
Deep Hybrid Networks link scattering.py scattering
OLÉ link ole.py ole
Grad-L2 Penalty link kernelregular.py gradl2
Cosine Loss (+ Cross-Entropy) -- cosineloss.py cosine
Harmonic Networks link harmonic.py harmonic
Full Convolution link fconv.py fconv
DSK Networks -- dsk_classifier.py dsk
Distilling Visual Priors link distill_pretraining.py
distill_classifier.py
dvp-pretrain
dvp-distill
Auxiliary Learning link auxilearn.py auxilearn
T-vMF Similarity link tvmf.py tvmf

:gear: Usage

Installation

To use the repository, clone it in your local system:

git clone https://github.com/lorenzobrigato/gem.git

and install the required packages with:

python -m pip install -r requirements.txt

Note: GEM requires PyTorch with GPU support. Hence, for instructions on how to install PyTorch versions compatible with your CUDA versions, see pytorch.org.

Method Evaluation

We provide a set of scripts located in the directories scripts and bash_scripts to reproduce the experimental pipeline presented in our paper. In particular, evaluating one method on the full benchmark consists in:

  1. Finding hyper-parameters by training the approach on the train{i}.txt split while evaluating on the respective val{i}.txt
  2. Training 10 instances of the method given the found configuration on the full trainval{i}.txt split while evaluating on the test split
  3. Repeating independently points 1. and 2. for all values of i

For all datasets, the number of training splits used in our paper is 3, hence i is in the range {0,1,2}. For the testing sets, in some cases we have multiple splits as for the training, in others we employed a single test0.txt split. We performed multiple independent evaluations changing dataset splits and optimization runs to account for random variance (particularly significant in the small-sample regime).

To separately perform 1. and 2., we respectively provide hpo.py and train.py / train_ray.py. It is also possible to do 1. and 2. sequentially by executing full_train.py. For achieving 3., refer to the bash scripts available in bash_scripts. We are now going to treat in more details all the available chioces in terms of runnable scripts.

Hyper-Parameter Optimization (HPO)

For what concerns HPO, we employ an efficient and easy-to-use library (Tune) and a state-of-the-art search algorithm (Asynchronous Successive Halving Algorithm (ASHA)).

Script hpo.py is dedicated to finding hyper-parameters of a method. For instance, searching for default hyper-parameters, i.e., learning rate, weight decay, and batch size, for the cross-entropy baseline with a Wide ResNet-16-8 on the ciFAIR-10 dataset and splits 0 (default) is achievable by running:

python scripts/hpo.py cifair10 \
--method xent \
--architecture wrn-16-8 \
--rand-shift 4 \
--epochs 500 \
--grace-period 50 \
--num-trials 250 \
--cpus-per-trial 8 \
--gpus-per-trial 0.5

After completion, the script will print on screen the found hyper-parameters. Notice that --grace-period and --num-trials refer to parameters of the search algorithm. that have been fixed for each dataset and are hard-coded in the bash scripts of folder bash_scripts. To have a complete view of all the arguments accepted by the script, chek the help message of the parser by running:

python scripts/hpo.py -h

Note also that you can configure the hardware resources spent on trials. For examle, with --gpus-per-trial 0.5 the script will run two trials in parallel. Exploit parallelism to speed up the search but consider that the number of trials per GPU is bounded by the GPU memory available.

Final Evaluation

Once that the hyper-parameters have been found, you can execute the training of a single model for the test evaluation with script train.py. Or you can also train multiple instances of the same model in parallel exploiting again the Tune library and script train_ray.py.

An example to train 10 instances of the baseline method with possibly found hyper-parameters (learning rate, weight decay, and batch size) is:

python scripts/train_ray.py cifair10 \
--method xent \
--architecture wrn-16-8 \
--rand-shift 4 \
--epochs 500 \
--lr 4.55e-3 \
--weight-decay 5.29e-3 \
--batch-size 10 \
--num-trials 10 \
--cpus-per-trial 8 \
--gpus-per-trial 0.5 \
--eval-interval 10 \
--save /home/user/gem_models/cifair10/cifair10_xent.pth \
--history /home/user/gem_logs/cifair10/cifair10_xent.json

Note that we are saving the model file and the history log containing the results by specifying the --save and history arguments.

Full Training

The HPO and final evaluation steps can be executed sequentially and from the same script full_train.py. Most of the arguments are shared with the previous scripts. A key difference regards the pattern "-f" that is added at the end of some arguments with the objective of discerning the two training phases. E.g, given --num-trials 250 and --num-trials-f 10, the script will run 250 trials for hyper-parameter optimization, and 10 trials for the final evaluation. For additional details refer to the help message of the parser:

python scripts/full_train.py -h

Multi-Split Training

To obtain a complete evaluation on one of the datasets of our benchmark is necessary to repeat the full training on the 3 splits. This is achievable by running one of the bash scripts in bash_scripts. Each of those scripts sequentially runs full_train.py.

Note: the default configurations for dataset-specific augmentations and parameters of the search algorithm are hard-coded inside the scripts. Any additional argument needed for the full training can be added in the call of the bash script. An example for the baseline training on ciFAIR-10 is:

bash bash_scripts/bench_cifair10.sh \
--method xent \
--cpus-per-trial 8 \
--gpus-per-trial 0.5 \
--eval-interval-f 10 \
--save-f /home/user/gem_models/cifair10/cifair10_xent.pth \
--history-f /home/user/gem_logs/cifair10/cifair10_xent.json

Given that multiple models/logs are saved, full_train.py also adds a temporal pattern representing the unique timestamp rigth before the file extensions of the names provided at --save-f / --history-f.

Evaluation

Script evaluate_baccuracy_json.py is available to compute balanced accuracy from single runs and mean/standard deviation over multiple runs. It also eventually save in a more compact format (JSON) a summary of such results. For more info execute:

python scripts/evaluate_baccuracy_json.py -h

Library Extension

More details soon!

:bar_chart: Results

Here are the full results for all methods currently evaluated on our benchmark:

Method ciFAIR-10 CUB ISIC 2018 EuroSAT CLaMM Avg.
Cross-Entropy Loss (baseline) 55.18% 70.79% 64.49% 90.58% 70.15% 70.24%
Deep Hybrid Networks 53.84% 55.37% 62.06% 88.77% 63.75% 64.76%
OLÉ 55.19% 66.55% 62.80% 90.29% 74.28% 69.82%
Grad-L2 Penalty 51.90% 51.94% 60.21% 81.50% 65.10% 62.13%
Cosine Loss 52.39% 66.94% 62.42% 88.53% 68.89% 67.83%
Cosine Loss + Cross-Entropy 52.77% 70.43% 63.17% 89.65% 70.64% 69.33%
Harmonic Networks 58.00% 73.07% 69.70% 91.98% 77.25% 74.00%
Full Convolution 54.64% 63.74% 57.34% 89.47% 69.06% 66.85%
DSK Networks 53.84% 69.75% 63.41% 91.09% 65.43% 68.70%
Distilling Visual Priors 57.80% 70.81% 62.39% 88.96% 69.07% 69.81%
Auxiliary Learning 51.84% 43.57% 61.70% 80.92% 60.24% 59.65%
T-vMF Similarity 56.75% 68.19% 64.60% 88.50% 69.33% 69.47%

All values represent the balanced classification accuracy averaged over 30 training runs. Precisely, three groups of 10 runs over the three dataset splits.
Bold results are the best in their column and italic results are not significantly worse than the best (on a level of 5%).

CUB, ISIC, and CLaMM have unbalanced test sets. For the other datasets, balanced classification accuracy is equivalent to standard accuracy.

:writing_hand: Citation

If you find this repository useful to your research, please consider citing our paper

@ARTICLE{9770050,
author={Brigato, Lorenzo and Barz, Björn and Iocchi, Luca and Denzler, Joachim},
journal={IEEE Access},
title={Image Classification With Small Datasets: Overview and Benchmark},
year={2022},
volume={10},
pages={49233-49250},
doi={10.1109/ACCESS.2022.3172939}
}