litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Add slow interconnect warning

Open rasbt opened this issue 1 year ago • 4 comments

Lots of users asked/raised issues whether there is a bug because multi-GPU training can be slower than single-GPU training. This is not due to a LitGPT bug but because machines with slow GPU connections were used.

This adds a warning if there is a slow GPU interconnect and suggests to use a different machine for multi-GPU training.

CC @apaz-cli

Fixes #1369 Fixes #607 Fixes #1581

rasbt avatar Jul 12 '24 18:07 rasbt

Nice, I just thought about it today.

The only question, maybe it's possible to launch something in a try/except block and just check that there is a proper connection. https://discuss.pytorch.org/t/simple-code-example-with-nvlink-support/125304 Right now I don't have access to multi-gpu machine, so cannot validate any of the options.

Andrei-Aksionov avatar Jul 12 '24 18:07 Andrei-Aksionov

How would you check it? Via the nccl-tests tool?

rasbt avatar Jul 12 '24 20:07 rasbt

No, nccl-tests are to measure performance. We don't need it.

I thought more like

import torch.distributed as dist
dist.init_process_group("nccl")

and then maybe do something with it 🤷.

Anyway, I think your approach fits the bill. Later, when I get acess to a multi-GPU machine, I'll try to check if NVLink is available through torch.

Andrei-Aksionov avatar Jul 13 '24 11:07 Andrei-Aksionov

No, nccl-tests are to measure performance. We don't need it.

I agree. That would be overkill, which is why I implemented the current approach.

Anyway, I think your approach fits the bill. Later, when I get access to a multi-GPU machine, I'll try to check if NVLink is available through torch.

I don't think you can get this info from within PyTorch, but please correct m if I am wrong. In that case I'd be happy to update it.

rasbt avatar Jul 13 '24 11:07 rasbt