private-transformers icon indicating copy to clipboard operation
private-transformers copied to clipboard

Training on multiple GPUs

Open mohummedalee opened this issue 8 months ago • 0 comments

I'm re-using the Trainer implemented in examples.classification.src.trainer. It largely looks like a port of the original Trainer source code but I noticed that has an additional check that stops training when multiple GPUs are available. Specifically:

if self.args.local_rank != -1 or self.args.n_gpu > 1:
    raise ValueError("Multi-gpu and distributed training is currently not supported.")

What could go wrong if I comment this out and let the distributed training proceed with torch.nn.DataParallel(model)? Appreciate the well-written code—thanks for the help in advance.

mohummedalee avatar Jun 11 '24 22:06 mohummedalee