PKD-for-BERT-Model-Compression Some questions about layer number (model size)

Some questions about layer number (model size)

Open ZLKong opened this issue 4 years ago • 0 comments

Hi,

Thank you for your interesting work! I have just started to learn BERT and distillation recently. I have some general questions regarding this topic.

I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?
Do you have to do pretraining every time you change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?
Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn't find any explanation.

Thanks, ZLK

Aug 06 '20 15:08 ZLKong