Thank you for open-sourcing your project!

Reproduced ImageNet result

I reimplemented the knowledge review method in a model-agnostic way with torchdistill. Using the reimplemented method, I successfully reproduced the ImageNet result for a pair of ResNet-18 (student, 71.64% accuracy) and ResNet-34 (teacher) as shown here. Hope this helps if you further study knowledge distillation.

Implementation of baselines for object detection

In your paper (Table 4), Hinton et al.'s knowledge distillation and FitNet methods are used as baselines for object detection. The KD method and the 2nd stage of FitNet method trains student model in end-to-end manner with teacher's final output. I was wondering how you could implement these methods for object detection models. Could you please publish the code as well?

From my understanding (as discussed here), such methods cannot be directly applied to object detection models like R-CNNs since they return different number of bounding boxes and class probabilities depending on both 1) an input image and 2) learnt model parameters. Thus, the shapes of outputs from teacher model may not match those from student model. Even when the shapes match like after several epochs of training, the order of the teacher's predicted objects in the input image may not be aligned with that of the student's predicted objects.

Aug 01 '21 02:08 yoshitomo-matsubara

Hi, Thank you for being interested in our work. For KD, we apply it to the classification branch. And the misalignment between teacher's and student's proposal do exist. We directly use the student's proposal for both teacher and student models to fix this problem. For FitNet, we apply it to the inner features (the second stages' output of FPN) instead of the final output. The FPN module always has the same dimension. What's more, we can always transfer the student's features to the same size as the teacher's. We implement these baselines based on the old version of detectron2 which is not compatible with the latest version. We will consider transferring the code and release it.

Aug 02 '21 02:08 akuxcw

Hi @akuxcw , thank you for your response.

For KD, we apply it to the classification branch. And the misalignment between teacher's and student's proposal do exist. We directly use the student's proposal for both teacher and student models to fix this problem.

Does this mean you obtain the output of teacher's classification branch by input -> student backbone -> student's proposal -> teacher's classification branch, as a teacher's output?

For FitNet, we apply it to the inner features (the second stages' output of FPN) instead of the final output. The FPN module always has the same dimension.

This is about 1st stage of FitNet method (that trains only the first layers in student model), right? What I meant by the 2nd stage of FitNet method is to train the entire model, after the 1st stage of the training, potentially with teacher's final output. Now I assume you applied your technique for KD to the FitNet's 2nd stage training as well.

Aug 02 '21 03:08 yoshitomo-matsubara

The process should be:

input -> student backbone -> student's proposal
                                 |
                                 v
input -> teacher backbone -> ROI Align -> teacher's output

We didn't follow the original two-stage training process of FitNet. We just use the FitNet's distillation loss during the training of student, just like KD or our method. The setting follows this repo

Aug 02 '21 04:08 akuxcw

Thank you for the clarification. Was there any reason for the choice of FitNet method implementation? I know some of other papers such as "Contrastive Representation Distillation" (for the repo you mentioned) used FitNet in that way, but couldn't find a reasonable statement to do so. Even though the CRD paper says their implementation is based on the original FitNet paper, it doesn't follow the original two-stage training scheme.

Aug 02 '21 10:08 yoshitomo-matsubara

I'm not sure about the reasons why previous works use FitNet this way. I guess the main reason is that the one-stage training is easier to implement. And the results of one-stage training are very likely to be better than that of the original two-stage training scheme. For me, the convenience of one-stage training is the most important reason :)

Aug 03 '21 06:08 akuxcw

Thank you @akuxcw for the response. Then, I'd suggest referring to CSE+L2 method (called L2 in "Prime-Aware Adaptive Distillation") instead of FitNet as this paper also compares the performance of FitNet to that of the L2-based method (one-stage). Actually, they showed the L2 method outperforms FitNet for CIFAR-100 and ImageNet datasets, and I reproduced the result of the L2 method for ImageNet dataset.

Aug 03 '21 11:08 yoshitomo-matsubara

ReviewKD
ReviewKD copied to clipboard