bitfusion-with-kubernetes-integration
bitfusion-with-kubernetes-integration copied to clipboard
Can we use bitfusion to run Distributed Data Parallel Pytorch code?
Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server without bitfusion. I want to know that whether bitfusion does not support torch.nn.DataParallel
(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) or nccl (https://developer.nvidia.com/nccl).
I am looking forward to your reply.
Have you solved this issue? I met the same problems too. But could not find any resources or solutions.
No! Bitfusion does not support DDP for the reason that some NCCL versions are not supported by Bitfusion. However, we cannot change the nccl version.