ToD-BERT icon indicating copy to clipboard operation
ToD-BERT copied to clipboard

Distributed training forthe RCL task

Open JadinTredupLP opened this issue 3 years ago • 2 comments

Hello, I am trying to pretrain todbert on my own dataset, but because of the size of the dataset I need to distribute training to speed up computation. It seems like distributed training is built into the MLM task, but distributing the RCL task throws an error. We have written some to distribute the RCL task but our training results show little-to-no improvement on the RS loss vs the single-GPU case. I am wondering if there is any specific reason you decided not to distribute the RCL task over multiple GPUs or a problem you encountered, or if there is just likely a bug in our code.

JadinTredupLP avatar Jan 25 '22 19:01 JadinTredupLP

Hi,

Can you provide what is the error when you run the RCL training? We did not focus too much on parallel training at that time and used the huggingface implementation for that.

jasonwu0731 avatar Jan 25 '22 22:01 jasonwu0731

I am not getting an error really, just the RS loss is not decreasing when it gets distributed. On a single GPU it converges fine (for a small amount of data) and for the same amount the distributed training did not converge at all.

JadinTredupLP avatar Jan 26 '22 17:01 JadinTredupLP