DeBERTa
DeBERTa copied to clipboard
Info on Deberta-v2-xlarge training infra
The paper talks about DeBERTa-large, base and DeBERTa1.5B model on V100 GPU. How is the DeBERTa-v2-xlarge trained? is the settings for the xlarge model same as used for large model in the paper? With DeBERTa-v2-xlarge having 900M parameters is any tensor parallelism used for training?