Add slow interconnect warning
Lots of users asked/raised issues whether there is a bug because multi-GPU training can be slower than single-GPU training. This is not due to a LitGPT bug but because machines with slow GPU connections were used.
This adds a warning if there is a slow GPU interconnect and suggests to use a different machine for multi-GPU training.
CC @apaz-cli
Fixes #1369 Fixes #607 Fixes #1581
Nice, I just thought about it today.
The only question, maybe it's possible to launch something in a try/except block and just check that there is a proper connection. https://discuss.pytorch.org/t/simple-code-example-with-nvlink-support/125304 Right now I don't have access to multi-gpu machine, so cannot validate any of the options.
How would you check it? Via the nccl-tests tool?
No, nccl-tests are to measure performance. We don't need it.
I thought more like
import torch.distributed as dist
dist.init_process_group("nccl")
and then maybe do something with it 🤷.
Anyway, I think your approach fits the bill. Later, when I get acess to a multi-GPU machine, I'll try to check if NVLink is available through torch.
No, nccl-tests are to measure performance. We don't need it.
I agree. That would be overkill, which is why I implemented the current approach.
Anyway, I think your approach fits the bill. Later, when I get access to a multi-GPU machine, I'll try to check if NVLink is available through torch.
I don't think you can get this info from within PyTorch, but please correct m if I am wrong. In that case I'd be happy to update it.