CWT-for-FSS icon indicating copy to clipboard operation
CWT-for-FSS copied to clipboard

Gpus

Open xiewende opened this issue 3 years ago • 4 comments

i use three gpus to train,but get an error,cound you give me some segguesttion

xiewende avatar Nov 10 '21 02:11 xiewende

请问你的错误解决了吗,我在多卡训练也遇到了错误。 如下: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

lixiang007666 avatar Dec 17 '21 10:12 lixiang007666

请问你的错误解决了吗,我在多卡训练也遇到了错误。 如下: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

Did you encounter this issue while loading the pretrained weights? If so, most likely you should specify map_location in your torch.load() somewhere.

YellowPig-zp avatar Feb 11 '22 10:02 YellowPig-zp

The error message is this:

(base) lixiang@vs008:~/CWT-for-FSS$ sh scripts/train.sh pascal 1 [0,1] 50 1
  0%|                                                                                       | 0/5953 [00:00<?, ?it/s]==> Running process rank 0.
FB_param_noise: 0
adapt_iter: 200
arch: resnet
augmentations: ['hor_flip', 'vert_flip', 'resize']
backbone_dim: 2048
batch_size: 2
batch_size_val: 2
bins: [1, 2, 3, 6]
bottleneck_dim: 512
ckpt_path: checkpoints/
ckpt_used: best
cls_lr: 0.1
data_root: pascal/
debug: False
distributed: True
dropout: 0.1
episodic: True
epochs: 20
gamma: 0.1
gpus: [0, 1]
heads: 4
image_size: 473
iter_per_epoch: 6000
layers: 50
log_freq: 50
lr_stepsize: 30
m_scale: False
main_optim: SGD
manual_seed: 2021
mean: [0.485, 0.456, 0.406]
milestones: [40, 70]
mixup: False
model_dir: model_ckpt
momentum: 0.9
n_runs: 1
nesterov: True
norm_feat: True
num_classes_tr: 2
num_classes_val: 5
padding_label: 255
port: 60989
pretrained: False
random_shot: False
resume_weights: /pretrained_models/
rot_max: 10
rot_min: -10
save_models: True
save_oracle: False
scale_lr: 1.0
scale_max: 2.0
scale_min: 0.5
scheduler: cosine
shot: 1
smoothing: True
std: [0.229, 0.224, 0.225]
test_name: default
test_num: 1000
test_split: default
train_list: lists/pascal/train.txt
train_name: pascal
train_split: 1
trans_lr: 0.001
use_split_coco: False
val_list: lists/pascal/val.txt
weight_decay: 0.0001
workers: 2
=> no weight found at '/pretrained_models/'
Processing data for [1, 2, 3, 4, 5, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
  0%|                                                                                       | 0/5953 [00:00<?, ?it/s]==> Running process rank 1.
FB_param_noise: 0
adapt_iter: 200
arch: resnet
augmentations: ['hor_flip', 'vert_flip', 'resize']
backbone_dim: 2048
batch_size: 2
batch_size_val: 2
bins: [1, 2, 3, 6]
bottleneck_dim: 512
ckpt_path: checkpoints/
ckpt_used: best
cls_lr: 0.1
data_root: pascal/
debug: False
distributed: True
dropout: 0.1
episodic: True
epochs: 20
gamma: 0.1
gpus: [0, 1]
heads: 4
image_size: 473
iter_per_epoch: 6000
layers: 50
log_freq: 50
lr_stepsize: 30
m_scale: False
main_optim: SGD
manual_seed: 2021
mean: [0.485, 0.456, 0.406]
milestones: [40, 70]
mixup: False
model_dir: model_ckpt
momentum: 0.9
n_runs: 1
nesterov: True
norm_feat: True
num_classes_tr: 2
num_classes_val: 5
padding_label: 255
port: 60989
pretrained: False
random_shot: False
resume_weights: /pretrained_models/
rot_max: 10
rot_min: -10
save_models: True
save_oracle: False
scale_lr: 1.0
scale_max: 2.0
scale_min: 0.5
scheduler: cosine
shot: 1
smoothing: True
std: [0.229, 0.224, 0.225]
test_name: default
test_num: 1000
test_split: default
train_list: lists/pascal/train.txt
train_name: pascal
train_split: 1
trans_lr: 0.001
use_split_coco: False
val_list: lists/pascal/val.txt
weight_decay: 0.0001
workers: 2
=> no weight found at '/pretrained_models/'
Processing data for [1, 2, 3, 4, 5, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
100%|██████████████████████████████████████████████████████████████████████████| 5953/5953 [00:05<00:00, 1021.86it/s]
100%|██████████████████████████████████████████████████████████████████████████| 5953/5953 [00:05<00:00, 1013.71it/s]
  0%|                                                                                       | 0/1449 [00:00<?, ?it/s]INFO: pascal -> pascal
INFO: 1 -> 1
>> Start Filtering classes 
>> Removed classes = [] 
>> Kept classes = ['bus', 'cat', 'car', 'chair', 'cow'] 
Processing data for [6, 7, 8, 9, 10]
  0%|                                                                                       | 0/1449 [00:00<?, ?it/s]INFO: pascal -> pascal
INFO: 1 -> 1
>> Start Filtering classes 
>> Removed classes = [] 
>> Kept classes = ['bus', 'cat', 'car', 'chair', 'cow'] 
Processing data for [6, 7, 8, 9, 10]
100%|███████████████████████████████████████████████████████████████████████████| 1449/1449 [00:03<00:00, 382.78it/s]
100%|███████████████████████████████████████████████████████████████████████████| 1449/1449 [00:04<00:00, 351.74it/s]
Traceback (most recent call last):
  File "/home/lixiang/anaconda3/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/lixiang/anaconda3/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/lixiang/CWT-for-FSS/src/train.py", line 360, in <module>
    mp.spawn(main_worker, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/lixiang/CWT-for-FSS/src/train.py", line 134, in main_worker
    _, _ = do_epoch(
  File "/home/lixiang/CWT-for-FSS/src/train.py", line 266, in do_epoch
    output_support = binary_cls(f_s)
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/lixiang/anaconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 442, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

@YellowPig-zp

lixiang007666 avatar Feb 14 '22 00:02 lixiang007666

请问你的错误解决了吗,我在多卡训练也遇到了错误。 如下: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper__cudnn_convolution)

Did you encounter this issue while loading the pretrained weights? If so, most likely you should specify map_location in your torch.load() somewhere.

I tried adding map_location parameter, didn't solve the problem.

lixiang007666 avatar Feb 14 '22 00:02 lixiang007666

Sorry, the current code is for one gpu. It would be nice to achieve multiple gpu version and pose a request. Thanks.

zhiheLu avatar Mar 30 '24 09:03 zhiheLu