Pretrained-Language-Model Difference between paper and source code in training DynaBERT

Difference between paper and source code in training DynaBERT

Open quancq opened this issue 3 years ago • 1 comments

I find different between paper and source code

In paper: student model (DynaBERT, width_mult, depth_mult) learn from teacher assistant model (DynaBERTw, width_mult, depth_mult=1)
In code: student model (DynaBERT, width_mult, depth_mult) learn from teacher assistant model (DynaBERTw, width_mult=1, depth_mult=1)

Could anyone explain this difference. Tkanks!

May 15 '21 03:05 quancq

The released code uses the width-adaptive teacher assistant at its largest width and depth (DynaBERTw, width_mult=1, depth_mult=1) as the teacher model. You can also use (DynaBERTw, width_mult, depth_mult=1) by inserting

teacher_model.apply(lambda m: setattr(m, 'width_mult', width_mult))

after line 124 in run_glue.py in source code

We tested both versions, and they have similar performances. So we released the simpler version, i.e., using the same width-adaptive teacher assistant at its largest width and depth (DynaBERTw, width_mult=1, depth_mult=1) as the teacher model.

May 19 '21 11:05 houlu369

Pretrained-Language-Model Pretrained-Language-Model copied to clipboard

Difference between paper and source code in training DynaBERT

Pretrained-Language-Model
Pretrained-Language-Model copied to clipboard