Pretrained-Language-Model
Pretrained-Language-Model copied to clipboard
Embedding layer distillation not implemented?
Discussed in the paper and included in results, but I can't see this referenced in the Readme or anywhere in the code. Was it implemented in a later (unreleased) version?
embedding layer distillation use the same loss function MSE as hidden state layer. so, the embedding layer distillation loss compute is same with hidden state. the code is at 958~960 line of task_distill.py(new_student_reps[0], new_teacher_reps[0] are the embedding layer output)。
@chuanhuayang I think he's talking about "general distillation". general distillation code dose not include embedding-layer distillation. but paper include it.
sequence_output of the TinyBertForSequenceClassification includes the Embedding layer's weights