improved-diffusion
improved-diffusion copied to clipboard
Training cannot work
Thanks for your great work! I am new to MPI and I ran into some nccl errors when I use your command to launch a training. My environment is
ubuntu20.04 2 * GTX3090 python3.7 + torch1.10 + cu113
My command is :
MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 3"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear"
TRAIN_FLAGS="--lr 1e-4 --batch_size 4"
mpiexec -n 2 python scripts/image_train.py --data_dir /home/CelebA-HQ-img/ $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
And I got:
Traceback (most recent call last): File "scripts/image_train.py", line 84, in
main() File "scripts/image_train.py", line 56, in main lr_anneal_steps=args.lr_anneal_steps, File "/projects/guided-diffusion/guided_diffusion/train_util.py", line 68, in init self._load_and_sync_parameters() File "/projects/guided-diffusion/guided_diffusion/train_util.py", line 123, in _load_and_sync_parameters dist_util.sync_params(self.model.parameters()) File "/projects/guided-diffusion/guided_diffusion/dist_util.py", line 101, in sync_params dist.broadcast(p, 0) File "/miniconda3/envs/tensorflow/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1187, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, unhandled cuda error, NCCL version 21.0.3 ncclUnhandledCudaError: Call to CUDA function failed.
However,I tried tweaking the code (mainly on dist_util.py) and using torch.distributed.launch
to start the training and it seems work.
Sorry, I found the reason. I'm running this command on a single machine with multiple GPUs, and some GPUs are occupied by other users, so I need to specify free GPUs for this training. I just modified the code slightly and it solved it.
Sorry, I found the reason. I'm running this command on a single machine with multiple GPUs, and some GPUs are occupied by other users, so I need to specify free GPUs for this training. I just modified the code slightly and it solved it.
I have the same problem, can you give me some advice. I add "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"" in code, but it doesn't work.
Launching with MPI, you need to specify each visible device for each process. My suggestion is to change to launching with torch.distribute.
hello , can you tell me how to modify the code to run this command on a single machine with multiple GPUs?
@wileewang hello , can you tell me how to modify the code to run this command on a single machine with multiple GPUs?
@pokameng Hello,have you solved this problem?
export NCCL_P2P_DISABLE=1 works