bitfusion-with-kubernetes-integration icon indicating copy to clipboard operation
bitfusion-with-kubernetes-integration copied to clipboard

Can we use bitfusion to run Distributed Data Parallel Pytorch code?

Open ljz756245026 opened this issue 3 years ago • 2 comments

Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server without bitfusion. I want to know that whether bitfusion does not support torch.nn.DataParallel (https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) or nccl (https://developer.nvidia.com/nccl).

I am looking forward to your reply.

ljz756245026 avatar Sep 27 '21 09:09 ljz756245026

Have you solved this issue? I met the same problems too. But could not find any resources or solutions.

YanJenHuang avatar Oct 25 '22 07:10 YanJenHuang

No! Bitfusion does not support DDP for the reason that some NCCL versions are not supported by Bitfusion. However, we cannot change the nccl version.

ljz756245026 avatar Oct 25 '22 10:10 ljz756245026