autodeeplab icon indicating copy to clipboard operation
autodeeplab copied to clipboard

Argument#1 'input' don't have the same device as tensor for argument#2

Open JingweiZhang12 opened this issue 4 years ago • 2 comments

I try to use apex, and find Traceback (most recent call last): File "train_autodeeplab.py", line 412, in <module> main() File "train_autodeeplab.py", line 405, in main trainer.training(epoch) File "train_autodeeplab.py", line 175, in training output = self.model(image) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/apex/amp/_initialize.py", line 194, in new_fwd **applier(kwargs, input_caster)) File "/home/zhangjw/AutoML/auto_deeplab.py", line 165, in forward temp = self.stem0 (x) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/zhangjw/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 320, in forward self.padding, self.dilation, self.groups) RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution) When I don't use apex, the error disappear. Do you have any suggestions? @NoamRosenberg Could your please offer your version of gcc, cuda, and cudnn, and type of your GPU? I guess the issue may be related to those, because a warning occurs in the runing: Warning: multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback. Original ImportError was: ModuleNotFoundError("No module named 'amp_C'") Did you see this warning?

JingweiZhang12 avatar Sep 05 '19 07:09 JingweiZhang12

What opt_level are you using? And do you run on one gpu or multiple gpus? Currently only a part of the opt_levels are working with multiple gpus. We issue a warning if you set a combination that is not currently supported.

iariav avatar Sep 05 '19 08:09 iariav

@iariav Thanks for your reply. I only run with the default opt_level='00' and on two gpus. The hyperparameters are as the following: CUDA_VISIBLE_DEVICES=8,9 python train_autodeeplab.py --batch-size 4 --dataset cityscapes --checkname Sep3 --alpha_epoch 20 --filter_multiplier 4 --resize 358 --crop_size 256

JingweiZhang12 avatar Sep 09 '19 06:09 JingweiZhang12