DeBERTa Pre-training times: v2 vs. v3

Pre-training times: v2 vs. v3

Open stefan-it opened this issue 2 years ago • 1 comments

Hi,

it would be very interesting to also see a comparison of pre-training times for DeBERTa v2 versus the recently released v3, that is using RTD.

The v2 paper mentioned pre-training times:

But what about v3 base, large and multi-lingual models :thinking:

Apr 11 '22 12:04 stefan-it

I was trying to pretrain DeBERTav2 with RTD objective (but without the Gradient-Disentangled Emb. sharing). I noticed that it runs way slower than electra (which is bert based).

I tried doing some quick benchmarking and noticed that deberta is twice as slow as bert for inference

Apr 13 '22 13:04 WissamAntoun

DeBERTa DeBERTa copied to clipboard

Pre-training times: v2 vs. v3

DeBERTa
DeBERTa copied to clipboard