guided-diffusion icon indicating copy to clipboard operation
guided-diffusion copied to clipboard

unable to train/sample using mpiexec on multiple GPUs

Open aksy1999 opened this issue 3 years ago • 4 comments

Thanks for providing the code implementation.

I am able to train and use the model on 1 GPU but I am having issues while using multiple GPUs .

I am creating multiple process using mpiexec as suggested in the repo (I tried mpiexec from both OpenMPI and MPICH and I am having the same issue).

Issue: For both sampling and training cases, multiple processes are created and models load on GPUs. I am not able to sample/train. I see no progress at all (seems like a deadlock situation).

A) Below is an example of the commands I am running for inference/sampling (as suggested in this repo- openai/guided_diffusion)

mpiexec -n 8 python classifier_sample.py --attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --classifier_scale 1.0 --classifier_path "models/256x256_classifier.pt" --model_path "models/256x256_diffusion.pt" --batch_size 1 --num_samples 4 --timestep_respacing 250

Problem A: The program is stopping at <line93, classifier_sample.py> i.e ,all_images.extend([sample.cpu().numpy() for sample in gathered_samples])

B) Below is an example of a command I am running for training (as suggested in the parent repo – /openai/improved diffusion)

mpiexec -n 8 python image_train.py --data_dir ./data_dir --image_size 256 --class_cond False --learn_sigma True --num_channels 256 --num_res_blocks 2 --num_head_channels 64 --attention_resolutions 32,16,8 --dropout 0.1 --diffusion_steps 1000 --noise_schedule linear --use_checkpoint True --use_scale_shift_norm True --resblock_updown True --use_fp16 True --use_new_attention_order True --lr 1e-4 --batch_size 32

Problem B: The program is stopping in TrainLoop init function- where distributeDataParallel(DDP) function is called i.e, self.ddp_model = DDP( self.model, device_ids=[dist_util.dev()], output_device=dist_util.dev(), broadcast_buffers=False, bucket_cap_mb=128, find_unused_parameters=False,)

I have waited for approximately 24 hours to observe if code runs, but it did not. I have tried different approaches also to create multiple process such as python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 and multiprocess.spawn. They did not work.

With this issue:

A) I request, if possible, could please provide the version details of all the dependencies. Such as PyTorch, CUDA, CUDNN, Python, OpenMPI/MPICH, mpi4py and so on. My problems may be due to dependency version incompatibility.

I also build PyTorch from the source with CUDA 11.2 and had the same issues.

B) Do you have any suggestions/insights for training. Did you see any such behavior? Could you please suggest a training strategy for ablation study?

Below are the dependencies version I am using currently (issue is reproducible with these version):

conda 4.10.3 Python 3.9.7 PyTorch 1.9.1 (py3.9_cuda11.1_cudnn8.0.5_0) cudatoolkit 11.1.74 mpich 3.4.2 mpi4py 3.1.1

I will be happy to provide any other details related to the dependencies I am using.

aksy1999 avatar Oct 06 '21 04:10 aksy1999

Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 (check here). Try using export NCCL_P2P_DISABLE=1 before using it, it worked for me'

python 3.8.11 pytorch 1.9.1 cudatoolkit 10.2.89 mpi4py 3.0.3

guillaumejs2403 avatar Nov 17 '21 09:11 guillaumejs2403

thx so much!!!

Kai-0515 avatar May 23 '22 05:05 Kai-0515

Hi, I had the same problem while running the training code. It seems that there is a deadlock with NCCL 2.7.8 (check here). Try using export NCCL_P2P_DISABLE=1 before using it, it worked for me'

python 3.8.11 pytorch 1.9.1 cudatoolkit 10.2.89 mpi4py 3.0.3

Thanks for the suggestions. Have you been facing the problem during resumeing training, e.g. loading ckpt, optimizer and so forth, when applying multi-gpus.

JiamingLiu-Jeremy avatar Jun 17 '22 16:06 JiamingLiu-Jeremy