MedSegDiff
MedSegDiff copied to clipboard
How to train on multi GPU?
When I use --multi_gpu 0,1,2, it has a error: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
And how to change the code? Thanks!
can you tell me which line the error is reported
When I run: python scripts/segmentation_train.py --data_name PROMISE12 --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2
It appears that:
Traceback (most recent call last):
File "scripts/segmentation_train.py", line 113, in
运行多卡时报错,求解决
python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2
training...
Traceback (most recent call last):
File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in
some part of your module is on different GPUs. Did you meet the same error running on example dataset? if it has no problem on example cases, then the problem is in your data loading process.
运行多卡时报错,求解决
python scripts/segmentation_train.py --data_name NC2016 --data_dir "/PublicFile/xp_data/NC2016/" --out_dir "./results/NC2016/trainv1" --image_size 256 --num_channels 128 --class_cond False --num_res_blocks 2 --num_heads 1 --learn_sigma True --use_scale_shift_norm False --attention_resolutions 16 --diffusion_steps 1000 --noise_schedule linear --rescale_learned_sigmas False --rescale_timesteps False --lr 1e-4 --batch_size 8 --multi_gpu 0,1,2
training... Traceback (most recent call last): File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 117, in main() File "/home/xp/diffusion/MedSegDiff/scripts/segmentation_train.py", line 69, in main TrainLoop( File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 186, in run_loop self.run_step(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 207, in run_step sample = self.forward_backward(batch, cond) File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/train_util.py", line 238, in forward_backward losses1 = compute_losses() File "/home/xp/diffusion/MedSegDiff/./guided_diffusion/gaussian_diffusion.py", line 1003, in training_losses_segmentation model_output, cal = model(x_t, self._scale_timesteps(t), **model_kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) # type: ignore[index] File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/xp/.conda/envs/pytorch_xp39/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 157, in forward raise RuntimeError("module must have its parameters and buffers " RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2
我遇到了一模一样的错误,请问您解决了嘛?