DeBERTa
DeBERTa copied to clipboard
Pre-training times: v2 vs. v3
Hi,
it would be very interesting to also see a comparison of pre-training times for DeBERTa v2 versus the recently released v3, that is using RTD.
The v2 paper mentioned pre-training times:
But what about v3 base, large and multi-lingual models :thinking:
I was trying to pretrain DeBERTav2 with RTD objective (but without the Gradient-Disentangled Emb. sharing). I noticed that it runs way slower than electra (which is bert based).
I tried doing some quick benchmarking and noticed that deberta is twice as slow as bert for inference