FKD icon indicating copy to clipboard operation
FKD copied to clipboard

Official code for our ECCV'22 paper "A Fast Knowledge Distillation Framework for Visual Recognition"

FKD: A Fast Knowledge Distillation Framework for Visual Recognition

Official PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition (ECCV 2022, ECCV paper, arXiv), Zhiqiang Shen and Eric Xing.

Abstract

Knowledge Distillation (KD) has been recognized as a useful tool in many visual tasks, such as the supervised classification and self-supervised representation learning, while the main drawback of a vanilla KD framework lies in its mechanism that most of the computational overhead is consumed on forwarding through the giant teacher networks, which makes the whole learning procedure in a low-efficient and costly manner. In this work, we propose a Fast Knowledge Distillation (FKD) framework that simulates the distillation training phase and generates soft labels following the multi-crop KD procedure, meanwhile enjoying the faster training speed than ReLabel as we have no post-processes like RoI align and softmax operations. Our FKD is even more efficient than the conventional classification framework when employing multi-crop in the same image for data loading. We achieve 79.8% using ResNet-50 on ImageNet-1K, outperforming ReLabel by 1.0%+ while being faster. We also demonstrate the efficiency advantage of FKD on the self-supervised learning task.

Citation

@article{shen2021afast,
      title={A Fast Knowledge Distillation Framework for Visual Recognition}, 
      author={Zhiqiang Shen and Eric Xing},
      year={2021},
      journal={arXiv preprint arXiv:2112.01528}
}

Supervised Training

Preparation

FKD Training on CNNs

To train a model, run train_FKD.py with the desired model architecture and the path to the soft label and ImageNet dataset:

python train_FKD.py -a resnet50 --lr 0.1 --num_crops 4 -b 1024 --cos --softlabel_path [soft label path] [imagenet-folder with train and val folders]

For --softlabel_path, simply use format as ./FKD_soft_label_500_crops_marginal_smoothing_k_5.

Multi-processing distributed training on single node with multiple GPUs:

python train_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a resnet50 --lr 0.1 --num_crops 4 -b 1024 --cos -j 32 \
--save_checkpoint_path ./FKD_nc_4_res50_plain \
--softlabel_path [soft label path] \
[imagenet-folder with train and val folders]

For multiple nodes multi-processing distributed training, please refer to official PyTorch ImageNet training code for details.

Evaluation

python train_FKD.py -a resnet50 -e --resume [model path] [imagenet-folder with train and val folders]

Training Speed Comparison

The training speed of each epoch is tested on CIAI cluster at MBZUAI with 8 NVIDIA V100 GPUs. The batch size is 1024 for all three methods: (i) regular/vanilla classification framework, (ii) Relabel and (iii) FKD. For Vanilla and ReLabel, we use the average of 10 epochs after the speed is stable. For FKD, we perform num_crops = 4 to calculate the average of (4 $\times$ 10) epochs, note that using 8 will give faster training speed. All other settings are the same for the comparison.

Method Network Training time per-epoch
Vanilla ResNet-50 579.36 sec/epoch
ReLabel ResNet-50 762.11 sec/epoch
FKD (Ours) ResNet-50 486.77 sec/epoch

Trained Models

Method Network accuracy (Top-1) weights configurations
ReLabel ResNet-50 78.9 -- --
FKD ResNet-50 80.1+1.2% link same as ReLabel while initial lr = 0.1 $\times$ $batch size \over 512$
FKD(Plain) ResNet-50 79.8 link Table 12 in paper
(w/o warmup&colorJ )
FKD(AdamW) ResNet-50 80.2 link Table 13 in paper
(same as our settings on ViT and SReT)
ReLabel ResNet-101 80.7 -- --
FKD ResNet-101 81.9+1.2% link Table 12 in paper
FKD(Plain) ResNet-101 81.7 link Table 12 in paper
(w/o warmup&colorJ )

Mobile-level Efficient Networks

Method Network FLOPs accuracy (Top-1) weights
FBNet FBNet-c100 375M 75.12% --
FKD FBNet-c100 375M 77.13%+2.01% link
EfficientNetv2 EfficientNetv2-B0 700M 78.35% --
FKD EfficientNetv2-B0 700M 79.94%+1.59% link

The training protocol is the same as we used for ViT/SReT:

# Use the same settings as on ViT and SReT
cd train_ViT
# Train the model
python -u train_ViT_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a tf_efficientnetv2_b0 \
--lr 0.002 --wd 0.05 \
--epochs 300 --cos \
--save_checkpoint_path ./FKD_nc_4_224_efficientnetv2_b0 \
-j 32 --num_classes 1000 \
--soft_label_type marginal_smoothing_k5  \
-b 1024 --num_crops 4 \
--softlabel_path [soft label path] \
[imagenet-folder with train and val folders]

FKD Training on ViT/DeiT and SReT

To train a ViT model, run train_ViT_FKD.py with the desired model architecture and the path to the soft label and ImageNet dataset:

cd train_ViT
python train_ViT_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a SReT_LT --lr 0.002 --wd 0.05 --num_crops 4 -b 1024 --cos \
--softlabel_path [soft label path] \
[imagenet-folder with train and val folders]

For the instructions of SReT_LT model, please refer to SReT for details.

Evaluation

python train_ViT_FKD.py -a SReT_LT -e --resume [model path] [imagenet-folder with train and val folders]

Trained Models

Model FLOPs #params accuracy (Top-1) weights configurations
DeiT-T-distill 1.3B 5.7M 74.5 -- --
FKD ViT/DeiT-T 1.3B 5.7M 75.2 link Table 13 in paper
SReT-LT-distill 1.2B 5.0M 77.7 -- --
FKD SReT-LT 1.2B 5.0M 78.7 link Table 13 in paper

Fast MEAL V2

Please see MEAL V2 for the instructions to run FKD with MEAL V2.

Self-supervised Representation Learning Using FKD

Please see FKD-SSL for the instructions to run FKD for SSL task.

Contact

Zhiqiang Shen (zhiqiangshen0214 at gmail.com or zhiqians at andrew.cmu.edu)