PyTorch-MAML icon indicating copy to clipboard operation
PyTorch-MAML copied to clipboard

MultiGPU Support (DataParallel)

Open turtleman99 opened this issue 2 years ago • 1 comments

Hi Fangzhou,

Thank you for your excellent work. The codebase is well-organized and easy to follow.

When I tried to train mini-imagenet using either 2 - 8 GPUs by the following command,

python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3

python keeps reporting errors shown as below,

meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    main(config)
  File "train.py", line 130, in main
    logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/PyTorch-MAML/models/maml.py", line 223, in forward
    updated_params = self._adapt(
  File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
    params, mom_buffer = self._inner_iter(
  File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
    grads = autograd.grad(loss, params.values(),
  File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.

However, the code only works while using 1 GPU. When n_episode=4, I assume the code should work on 2 or 4 GPUs.

Framework Versions:

  • python: 3.8
  • pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0

Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!

turtleman99 avatar Apr 27 '22 15:04 turtleman99

@turtleman99 I have the same problem, have you found a solution?

woreom avatar Jan 22 '24 23:01 woreom