MedSegDiff icon indicating copy to clipboard operation
MedSegDiff copied to clipboard

How to train on multi GPU?

Open kjRainy opened this issue 1 year ago • 5 comments

When I use --multi_gpu 0,1,2, it has a error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

And how to change the code? Thanks!

kjRainy avatar Apr 21 '23 09:04 kjRainy

can you tell me which line the error is reported

WuJunde avatar Apr 21 '23 09:04 WuJunde

When I run: python scripts/segmentation_train.py --data_name PROMISE12 --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

It appears that: Traceback (most recent call last): File "scripts/segmentation_train.py", line 113, in main() File "scripts/segmentation_train.py", line 82, in main lr_anneal_steps=args.lr_anneal_steps, File "./guided_diffusion/train_util.py", line 186, in run_loop self.run_step(batch, cond) File "./guided_diffusion/train_util.py", line 207, in run_step sample = self.forward_backward(batch, cond) File "./guided_diffusion/train_util.py", line 238, in forward_backward losses1 = compute_losses() File "./guided_diffusion/gaussian_diffusion.py", line 1007, in training_losses_segmentation clip_denoised=False, File "./guided_diffusion/gaussian_diffusion.py", line 941, in _vb_terms_bpd model, x_t, t, clip_denoised=clip_denoised, model_kwargs=model_kwargs File "./guided_diffusion/respace.py", line 90, in p_mean_variance return super().p_mean_variance(self._wrap_model(model), *args, **kwargs) File "./guided_diffusion/gaussian_diffusion.py", line 287, in p_mean_variance model_log_variance = frac * max_log + (1 - frac) * min_log RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

kjRainy avatar Apr 21 '23 10:04 kjRainy

运行多卡时报错,求解决

python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

training... Traceback (most recent call last): File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in main() File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 69, in main TrainLoop( File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 186, in run_loop self.run_step(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 207, in run_step sample = self.forward_backward(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 238, in forward_backward losses1 = compute_losses() File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/gaussian_diffusion.py", line 1003, in training_losses_segmentation model_output, cal = model(x_t, self._scale_timesteps(t), **model_kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index] File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

xupinggl avatar May 16 '23 12:05 xupinggl

some part of your module is on different GPUs. Did you meet the same error running on example dataset? if it has no problem on example cases, then the problem is in your data loading process.

WuJunde avatar May 25 '23 13:05 WuJunde

运行多卡时报错,求解决

python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2

training... Traceback (most recent call last): File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in main() File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 69, in main TrainLoop( File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 186, in run_loop self.run_step(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 207, in run_step sample = self.forward_backward(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 238, in forward_backward losses1 = compute_losses() File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/gaussian_diffusion.py", line 1003, in training_losses_segmentation model_output, cal = model(x_t, self._scale_timesteps(t), **model_kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index] File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

我遇到了一模一样的错误,请问您解决了嘛?

DBook111 avatar Jul 04 '23 14:07 DBook111