examples icon indicating copy to clipboard operation
examples copied to clipboard

torch.multiprocessing subprocess receives tensor with zeros rather than actual data

Open dfarhi opened this issue 3 years ago • 2 comments

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

th.multiprocessing seems to not send tensor data to spawned processes on my setup.

  • Pytorch version:
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
  • Operating System and version: Windows 10 version 21H1
  • Cuda 11.7

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: mnist_hogwild
  • Link to code or data to repro [if any]:

Expected Behavior

Insert a print into the start of train.train to check the parameter has been copied to subprocess correctly:

    print(f"Norm was: {model.fc1.weight.norm().item()}")

The above print should print some random number. When I run without cuda, it does so:

>python main.py
Norm was: 4.082266807556152
Norm was: 4.081115245819092
... [training begins]

Current Behavior

When I run with cuda the tensor is zero:

>python main.py --cuda
Norm was: 0.0
Norm was: 0.0
... [training begins]

Repro

I think this is not a problem with the example but a problem with the base torch.multiprocesssing, or a problem with my installation. The issue seems to be that any tensors sent to a subprocess have their data replaced with zeros.

I've put above the steps to reproduce this issue in the mnist_hogwild example (the steps are just "run it on cuda on my device").

As an even more minimal repro, this also fails for me:

import torch as th
import torch.multiprocessing as mp

if __name__ == "__main__":
    parameter = th.randn(1, device='cuda:0')

    print(parameter)  # here parameter is a 1x1 tensor with a random number
    mp.set_start_method("spawn")

    p = mp.Process(target=print, args=(parameter,))  # here parameter is a 1x1 zero tensor.
    p.start()
    p.join()

[Edited to simplify repro code]

dfarhi avatar Jun 16 '22 17:06 dfarhi

Hi @dfarhi , I'm not able to reproduce it with torch==1.12.1+cu102 in Ubuntu 22.04 LTS. Is it still reproducible on your side?

hudeven avatar Aug 09 '22 22:08 hudeven

I am having similar problems with torch==2.0.0+cu117

using torch.multiprocessing.Pool.map_async

Has a fix or workaround been found?

iRRe33-smk avatar Apr 26 '23 11:04 iRRe33-smk