examples icon indicating copy to clipboard operation
examples copied to clipboard

torch.multiprocessing subprocess receives tensor with zeros rather than actual data

Open dfarhi opened this issue 3 years ago • 2 comments
trafficstars

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

th.multiprocessing seems to not send tensor data to spawned processes on my setup.

  • Pytorch version:
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
  • Operating System and version: Windows 10 version 21H1
  • Cuda 11.7

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: mnist_hogwild
  • Link to code or data to repro [if any]:

Expected Behavior

Insert a print into the start of train.train to check the parameter has been copied to subprocess correctly:

    print(f"Norm was: {model.fc1.weight.norm().item()}")

The above print should print some random number. When I run without cuda, it does so:

>python main.py
Norm was: 4.082266807556152
Norm was: 4.081115245819092
... [training begins]

Current Behavior

When I run with cuda the tensor is zero:

>python main.py --cuda
Norm was: 0.0
Norm was: 0.0
... [training begins]

Repro

I think this is not a problem with the example but a problem with the base torch.multiprocesssing, or a problem with my installation. The issue seems to be that any tensors sent to a subprocess have their data replaced with zeros.

I've put above the steps to reproduce this issue in the mnist_hogwild example (the steps are just "run it on cuda on my device").

As an even more minimal repro, this also fails for me:

import torch as th
import torch.multiprocessing as mp

if __name__ == "__main__":
    parameter = th.randn(1, device='cuda:0')

    print(parameter)  # here parameter is a 1x1 tensor with a random number
    mp.set_start_method("spawn")

    p = mp.Process(target=print, args=(parameter,))  # here parameter is a 1x1 zero tensor.
    p.start()
    p.join()

[Edited to simplify repro code]

dfarhi avatar Jun 16 '22 17:06 dfarhi

Hi @dfarhi , I'm not able to reproduce it with torch==1.12.1+cu102 in Ubuntu 22.04 LTS. Is it still reproducible on your side?

hudeven avatar Aug 09 '22 22:08 hudeven

I am having similar problems with torch==2.0.0+cu117

using torch.multiprocessing.Pool.map_async

Has a fix or workaround been found?

iRRe33-smk avatar Apr 26 '23 11:04 iRRe33-smk