Pretrained-Language-Model
Pretrained-Language-Model copied to clipboard
Difference between paper and source code in training DynaBERT
I find different between paper and source code
- In paper: student model (DynaBERT, width_mult, depth_mult) learn from teacher assistant model (DynaBERTw, width_mult, depth_mult=1)
- In code: student model (DynaBERT, width_mult, depth_mult) learn from teacher assistant model (DynaBERTw, width_mult=1, depth_mult=1)
Could anyone explain this difference. Tkanks!
The released code uses the width-adaptive teacher assistant at its largest width and depth (DynaBERTw, width_mult=1, depth_mult=1) as the teacher model. You can also use (DynaBERTw, width_mult, depth_mult=1) by inserting
teacher_model.apply(lambda m: setattr(m, 'width_mult', width_mult))
after line 124 in run_glue.py in source code
We tested both versions, and they have similar performances. So we released the simpler version, i.e., using the same width-adaptive teacher assistant at its largest width and depth (DynaBERTw, width_mult=1, depth_mult=1) as the teacher model.