fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Knowledge Distillation

Open VarunGumma opened this issue 2 years ago • 10 comments

Is there any implementation of Knowledge Distillation in Fairseq? I need to distilled a large multilingual Transformer model and I am not finding any suitable implementation for it here.

VarunGumma avatar Sep 20 '22 14:09 VarunGumma

@VarunGumma I am also trying to implement KD with the modification of fairseq

HeegonJin avatar Sep 26 '22 07:09 HeegonJin

@HeegonJin I have a basic implementation of KD in my repo here It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation They have a much older version of fairseq, and I have tried to integrate their changes into the latest version.

EDIT: My repository is very dynamic, and I try to include the new and relevant techniques in KD as much as possible. You may occasionally see broken stuff, so please raise issues for that or open PRs. The README of my repository should contain enough documentation on how to run KD training in fairseq, but if you have any more queries, please feel free to reach out to me.

VarunGumma avatar Sep 29 '22 17:09 VarunGumma

@VarunGumma Nice work! It might be even better if it could use attention-based distillation such as tinyBERT and MiniLM. I would try on those. Thanks.

HeegonJin avatar Oct 04 '22 01:10 HeegonJin

@VarunGumma Hello, I tried to run your code after "pip install --editable ./" but it tells "fairseq-train: error: unrecognized arguments: --distillation-strategy batch_level --distillation-rate 0.5 --temperature 2.5 --temperature-schedule none --alpha-kd 5"

HeegonJin avatar Oct 04 '22 03:10 HeegonJin

Please use the latest version of my code and you can find an example of knowledge_distillation_translation in the examples folder. As this work is under progress, I make multiple bug fixes and changes, so please stay up to date with the code.

VarunGumma avatar Oct 04 '22 03:10 VarunGumma

@VarunGumma Could you please give a little more details about the dataset you used and an opt "--user-dir $custom_model_dir"?

HeegonJin avatar Oct 04 '22 14:10 HeegonJin

@HeegonJin I use a custom model architecture which I defined in a file in that directory $custom_model_dir. If you are using models (parent and student) which are defined in fairseq library, you won't need that parameter.

VarunGumma avatar Oct 04 '22 17:10 VarunGumma

@HeegonJin I a basic implementation of KD in my repo here It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation They have a much older version of fairseq and I have tried to integrate their changes in the newer version of fairseq.

Here is a sample bash script to distill a transformer-4x model to a smaller transformer-base model Thanks for the detail! By the way, your "sample bash script" directs to a random paper.

HeegonJin avatar Oct 04 '22 23:10 HeegonJin

@HeegonJin I a basic implementation of KD in my repo here It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation They have a much older version of fairseq and I have tried to integrate their changes in the newer version of fairseq. Here is a sample bash script to distill a transformer-4x model to a smaller transformer-base model Thanks for the detail! By the way, your "sample bash script" directs to a random paper.

I have fixed the issue

VarunGumma avatar Oct 05 '22 07:10 VarunGumma

@VarunGumma Hello, if I just want to do the simplest knowledge distillation, can I do this: train the standard transformer model on the original data, and then translate the train data set to get the distilled data based on the original data. Finally use the distilled data to train the student model . Does this enable simple knowledge distillation ?

kkeleve avatar Oct 18 '22 09:10 kkeleve

Yes. What you did is hard label distillation @kkeleve

robotsp avatar Feb 20 '23 07:02 robotsp