examples torch.multiprocessing subprocess receives tensor with zeros rather than actual data

torch.multiprocessing subprocess receives tensor with zeros rather than actual data

Open dfarhi opened this issue 3 years ago • 2 comments

trafficstars

Your issue may already be reported! Please search on the issue tracker before creating one.

Context

th.multiprocessing seems to not send tensor data to spawned processes on my setup.

Pytorch version:

torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113

Operating System and version: Windows 10 version 21H1
Cuda 11.7

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: no
Is it a CPU or GPU environment?: GPU
Which example are you using: mnist_hogwild
Link to code or data to repro [if any]:

Expected Behavior

Insert a print into the start of train.train to check the parameter has been copied to subprocess correctly:

    print(f"Norm was: {model.fc1.weight.norm().item()}")

The above print should print some random number. When I run without cuda, it does so:

>python main.py
Norm was: 4.082266807556152
Norm was: 4.081115245819092
... [training begins]

Current Behavior

When I run with cuda the tensor is zero:

>python main.py --cuda
Norm was: 0.0
Norm was: 0.0
... [training begins]

Repro

I think this is not a problem with the example but a problem with the base torch.multiprocesssing, or a problem with my installation. The issue seems to be that any tensors sent to a subprocess have their data replaced with zeros.

I've put above the steps to reproduce this issue in the mnist_hogwild example (the steps are just "run it on cuda on my device").

As an even more minimal repro, this also fails for me:

import torch as th
import torch.multiprocessing as mp

if __name__ == "__main__":
    parameter = th.randn(1, device='cuda:0')

    print(parameter)  # here parameter is a 1x1 tensor with a random number
    mp.set_start_method("spawn")

    p = mp.Process(target=print, args=(parameter,))  # here parameter is a 1x1 zero tensor.
    p.start()
    p.join()

[Edited to simplify repro code]

Jun 16 '22 17:06 dfarhi

Hi @dfarhi , I'm not able to reproduce it with torch==1.12.1+cu102 in Ubuntu 22.04 LTS. Is it still reproducible on your side?

Aug 09 '22 22:08 hudeven

I am having similar problems with torch==2.0.0+cu117

using torch.multiprocessing.Pool.map_async

Has a fix or workaround been found?

Apr 26 '23 11:04 iRRe33-smk

examples examples copied to clipboard

torch.multiprocessing subprocess receives tensor with zeros rather than actual data

Context

Your Environment

Expected Behavior

Current Behavior

Repro

examples
examples copied to clipboard