fairseq
fairseq copied to clipboard
Knowledge Distillation
Is there any implementation of Knowledge Distillation in Fairseq? I need to distilled a large multilingual Transformer model and I am not finding any suitable implementation for it here.
@VarunGumma I am also trying to implement KD with the modification of fairseq
@HeegonJin I have a basic implementation of KD in my repo here
It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation
They have a much older version of fairseq
, and I have tried to integrate their changes into the latest version.
EDIT: My repository is very dynamic, and I try to include the new and relevant techniques in KD as much as possible. You may occasionally see broken stuff, so please raise issues for that or open PRs. The README
of my repository should contain enough documentation on how to run KD training in fairseq
, but if you have any more queries, please feel free to reach out to me.
@VarunGumma Nice work! It might be even better if it could use attention-based distillation such as tinyBERT and MiniLM. I would try on those. Thanks.
@VarunGumma Hello, I tried to run your code after "pip install --editable ./" but it tells "fairseq-train: error: unrecognized arguments: --distillation-strategy batch_level --distillation-rate 0.5 --temperature 2.5 --temperature-schedule none --alpha-kd 5"
Please use the latest version of my code and you can find an example of knowledge_distillation_translation
in the examples folder. As this work is under progress, I make multiple bug fixes and changes, so please stay up to date with the code.
@VarunGumma Could you please give a little more details about the dataset you used and an opt "--user-dir $custom_model_dir"?
@HeegonJin I use a custom model architecture which I defined in a file in that directory $custom_model_dir
. If you are using models (parent and student) which are defined in fairseq
library, you won't need that parameter.
@HeegonJin I a basic implementation of KD in my repo here It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation They have a much older version of
fairseq
and I have tried to integrate their changes in the newer version of fairseq.Here is a sample bash script to distill a
transformer-4x
model to a smallertransformer-base
model Thanks for the detail! By the way, your "sample bash script" directs to a random paper.
@HeegonJin I a basic implementation of KD in my repo here It is based on the implementation of https://github.com/LeslieOverfitting/selective_distillation They have a much older version of
fairseq
and I have tried to integrate their changes in the newer version of fairseq. Here is a sample bash script to distill atransformer-4x
model to a smallertransformer-base
model Thanks for the detail! By the way, your "sample bash script" directs to a random paper.
I have fixed the issue
@VarunGumma Hello, if I just want to do the simplest knowledge distillation, can I do this: train the standard transformer model on the original data, and then translate the train data set to get the distilled data based on the original data. Finally use the distilled data to train the student model . Does this enable simple knowledge distillation ?
Yes. What you did is hard label distillation @kkeleve