PKD-for-BERT-Model-Compression
PKD-for-BERT-Model-Compression copied to clipboard
Some questions about layer number (model size)
Hi,
Thank you for your interesting work! I have just started to learn BERT and distillation recently. I have some general questions regarding this topic.
-
I want to compare the performance of BERT with different model size (transformer block number). Is it necessary to do distillation? If I just train a BERT with 6 Layers without distillation, does the performance look bad?
-
Do you have to do pretraining every time you change the layer number of BERT? Is it possible to just remove some layers in an existing pre-trained model and finetune on tasks?
-
Why BERT has 12 blocks? Not 11 or 13 etc. ? I couldn't find any explanation.
Thanks, ZLK