improved-diffusion icon indicating copy to clipboard operation
improved-diffusion copied to clipboard

Training cannot work

Open wileewang opened this issue 2 years ago • 7 comments

Thanks for your great work! I am new to MPI and I ran into some nccl errors when I use your command to launch a training. My environment is

ubuntu20.04 2 * GTX3090 python3.7 + torch1.10 + cu113

My command is : MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3" DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear" TRAIN_FLAGS="--lr 1e-4 --batch_size 4" mpiexec -n 2 python scripts/image_train.py --data_dir /home/CelebA-HQ-img/ $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS And I got:

Traceback (most recent call last): File "scripts/image_train.py", line 84, in main() File "scripts/image_train.py", line 56, in main lr_anneal_steps=args.lr_anneal_steps, File "/projects/guided-diffusion/guided_diffusion/train_util.py", line 68, in init self._load_and_sync_parameters() File "/projects/guided-diffusion/guided_diffusion/train_util.py", line 123, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/projects/guided-diffusion/guided_diffusion/dist_util.py", line 101, in sync_params dist.broadcast(p, 0) File "/miniconda3/envs/tensorflow/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3 ncclUnhandledCudaError: Call to CUDA function failed.

However,I tried tweaking the code (mainly on dist_util.py) and using torch.distributed.launch to start the training and it seems work.

wileewang avatar Jul 07 '22 09:07 wileewang

Sorry, I found the reason. I'm running this command on a single machine with multiple GPUs, and some GPUs are occupied by other users, so I need to specify free GPUs for this training. I just modified the code slightly and it solved it.

wileewang avatar Jul 07 '22 14:07 wileewang

Sorry, I found the reason. I'm running this command on a single machine with multiple GPUs, and some GPUs are occupied by other users, so I need to specify free GPUs for this training. I just modified the code slightly and it solved it.

I have the same problem, can you give me some advice. I add "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"" in code, but it doesn't work.

baymin0220 avatar Aug 24 '22 13:08 baymin0220

Launching with MPI, you need to specify each visible device for each process. My suggestion is to change to launching with torch.distribute.

wileewang avatar Aug 25 '22 03:08 wileewang

hello , can you tell me how to modify the code to run this command on a single machine with multiple GPUs?

pokameng avatar Oct 13 '22 11:10 pokameng

@wileewang hello , can you tell me how to modify the code to run this command on a single machine with multiple GPUs?

pokameng avatar Oct 13 '22 11:10 pokameng

@pokameng Hello,have you solved this problem?

hxy-123-coder avatar Jun 13 '23 07:06 hxy-123-coder

export NCCL_P2P_DISABLE=1 works

jordanchenzb avatar Dec 05 '23 07:12 jordanchenzb