Pretrained-Language-Model Embedding layer distillation not implemented?

Embedding layer distillation not implemented?

Open aarmstrong78 opened this issue 5 years ago • 3 comments

Discussed in the paper and included in results, but I can't see this referenced in the Readme or anywhere in the code. Was it implemented in a later (unreleased) version?

Feb 25 '20 09:02 aarmstrong78

embedding layer distillation use the same loss function MSE as hidden state layer. so, the embedding layer distillation loss compute is same with hidden state. the code is at 958~960 line of task_distill.py（new_student_reps[0], new_teacher_reps[0] are the embedding layer output）。

Mar 06 '20 07:03 littttttlebird

@chuanhuayang I think he's talking about "general distillation". general distillation code dose not include embedding-layer distillation. but paper include it.

Apr 25 '20 16:04 silencio94

sequence_output of the TinyBertForSequenceClassification includes the Embedding layer's weights

Aug 31 '21 09:08 ailinbest

Pretrained-Language-Model Pretrained-Language-Model copied to clipboard

Embedding layer distillation not implemented?

Pretrained-Language-Model
Pretrained-Language-Model copied to clipboard