examples
examples copied to clipboard
torch.multiprocessing subprocess receives tensor with zeros rather than actual data
Your issue may already be reported! Please search on the issue tracker before creating one.
Context
th.multiprocessing seems to not send tensor data to spawned processes on my setup.
- Pytorch version:
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
- Operating System and version: Windows 10 version 21H1
- Cuda 11.7
Your Environment
- Installed using source? [yes/no]: no
- Are you planning to deploy it using docker container? [yes/no]: no
- Is it a CPU or GPU environment?: GPU
- Which example are you using: mnist_hogwild
- Link to code or data to repro [if any]:
Expected Behavior
Insert a print into the start of train.train to check the parameter has been copied to subprocess correctly:
print(f"Norm was: {model.fc1.weight.norm().item()}")
The above print should print some random number. When I run without cuda, it does so:
>python main.py
Norm was: 4.082266807556152
Norm was: 4.081115245819092
... [training begins]
Current Behavior
When I run with cuda the tensor is zero:
>python main.py --cuda
Norm was: 0.0
Norm was: 0.0
... [training begins]
Repro
I think this is not a problem with the example but a problem with the base torch.multiprocesssing, or a problem with my installation. The issue seems to be that any tensors sent to a subprocess have their data replaced with zeros.
I've put above the steps to reproduce this issue in the mnist_hogwild example (the steps are just "run it on cuda on my device").
As an even more minimal repro, this also fails for me:
import torch as th
import torch.multiprocessing as mp
if __name__ == "__main__":
parameter = th.randn(1, device='cuda:0')
print(parameter) # here parameter is a 1x1 tensor with a random number
mp.set_start_method("spawn")
p = mp.Process(target=print, args=(parameter,)) # here parameter is a 1x1 zero tensor.
p.start()
p.join()
[Edited to simplify repro code]
Hi @dfarhi , I'm not able to reproduce it with torch==1.12.1+cu102 in Ubuntu 22.04 LTS. Is it still reproducible on your side?
I am having similar problems with torch==2.0.0+cu117
using torch.multiprocessing.Pool.map_async
Has a fix or workaround been found?