deep-learning-for-image-processing icon indicating copy to clipboard operation
deep-learning-for-image-processing copied to clipboard

多gpu运行时候出错

Open BarryYHX opened this issue 3 years ago • 2 comments

System information

  • Have I written custom code: Yes
  • OS Platform(e.g., window10 or Linux Ubuntu 16.04): Linux
  • Python version: 3.8
  • Deep learning framework and version(e.g., Tensorflow2.1 or Pytorch1.3): pytorch1.7.1
  • Use GPU or not: Use
  • CUDA/cuDNN version(if you use GPU): CUDA11.7
  • The network you trained(e.g., Resnet34 network): faster_res50_rpn

Describe the current behavior

您好,我用train_multi_GPU.py跑VG的数据集,数据集是按照my_dataset.py中的输出进行设置的,也转成了tensor,但是在”global_features,loss_dict = model(images, targets)“这一步的时候总是报"RuntimeError: chunk expects at least a 1-dimensional tensor“错误,不知道是哪个输入没有满足要求,请问有没有什么解决的办法?

我的自定义部分:将训练数据集改成了VG,将coco相关的代码注释了,同时取消了验证集,还有一部分代码写在roi_head之后,不会影响前面的基础模型的训练,其它地方的代码都没有动过。

Error info / logs Traceback (most recent call last): File "train_multi_GPU.py", line 273, in main(args) File "train_multi_GPU.py", line 151, in main mean_loss, lr = utils.train_one_epoch(model, optimizer, data_loader, File "/home/zzyyxx/Image_Catpion/faster_rcnn/train_utils/train_eval_utils.py", line 46, in train_one_epoch global_features,loss_dict = model(images, targets) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 617, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 643, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter res = scatter_map(inputs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map return list(map(list, zip(*map(scatter_map, obj)))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map return list(map(type(obj), zip(*map(scatter_map, obj.items())))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 92, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 186, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: chunk expects at least a 1-dimensional tensor Traceback (most recent call last): File "train_multi_GPU.py", line 273, in main(args) File "train_multi_GPU.py", line 151, in main mean_loss, lr = utils.train_one_epoch(model, optimizer, data_loader, File "/home/zzyyxx/Image_Catpion/faster_rcnn/train_utils/train_eval_utils.py", line 46, in train_one_epoch global_features,loss_dict = model(images, targets) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 617, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 643, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter res = scatter_map(inputs) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 17, in scatter_map return list(map(list, zip(*map(scatter_map, obj)))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 19, in scatter_map return list(map(type(obj), zip(*map(scatter_map, obj.items())))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map return list(zip(*map(scatter_map, obj))) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 92, in forward outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 186, in scatter return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) RuntimeError: chunk expects at least a 1-dimensional tensor Traceback (most recent call last): File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/zzyyxx/enter/envs/ZTorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/zzyyxx/enter/envs/ZTorch/bin/python', '-u', 'train_multi_GPU.py']' returned non-zero exit status 1.

BarryYHX avatar Nov 29 '22 09:11 BarryYHX

你单卡跑通了吗?请先在单卡上验证没有任何问题后再去使用多卡训练

WZMIAOMIAO avatar Dec 03 '22 05:12 WZMIAOMIAO

你单卡跑通了吗?请先在单卡上验证没有任何问题后再去使用多卡训练

用train_res50_fpn.py可以跑通,但是如果是--nproc_per_node=1会报相同的错误

BarryYHX avatar Dec 06 '22 06:12 BarryYHX