DifFace icon indicating copy to clipboard operation
DifFace copied to clipboard

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Open lajihaonange opened this issue 1 year ago • 2 comments

I met this problem when I tried to run the command CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --gpu_id 0123 --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir myfolder. Could someone help me solve it?

lajihaonange avatar Jul 17 '23 01:07 lajihaonange

I have updated the code. Please have a try: CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nproc_per_node=4 --nnodes=1 main_diffusion.py --cfg_path configs/training/diffusion_ffhq512.yaml --save_dir yourfolder

I suggest you firstly train the model using one GPU, and then turn to the distributed training.

zsyOAOA avatar Jul 17 '23 06:07 zsyOAOA

Thank you for your timely reply. I have used single GPU for training and successfully, I will try your new code right now.

lajihaonange avatar Jul 17 '23 11:07 lajihaonange