vision
vision copied to clipboard
Torchvision transform hangs when using python multiprocessing and model inference
🐛 Describe the bug
Torchvision transforms cause code to hang when using python multiprocessing and a model on inference.
In particular, I'm seeing hangs when using a CLIP model and entirely unrelated torch code running in a multiprocess. An issue was filed against the CLIP repository here (https://github.com/openai/CLIP/issues/130) but I figured this should also be flagged on torchvision because I don't think this issue has to do with their model in particular. The model hangs on img.permute((2, 0, 1)).contiguous()
on this line in transforms/functional.py. This is in turn called by the ToTensor transform at F.to_tensor(pic)
on this line.
Minimal code sample in which I split the to_tensor transform and remove any unneeded parts:
import torch
import clip
from PIL import Image
import multiprocessing as mp
model = clip.model.CLIP(512, 224, 12, 768, 32, 77, 49408, 512, 8, 12)
def test():
print("GETTING IMAGE")
im = Image.open("CLIP.png")
print("CONVERTING")
im = im.convert('RGB')
print("MADE TENSOR")
img = torch.ByteTensor(torch.ByteStorage.from_buffer(im.tobytes()))
print("VIEW")
img = img.view(im.size[1], im.size[0], len(im.getbands()))
print("PERMUTING")
img = img.permute((2, 0, 1))
print("CONTINGUOUS")
img = img.contiguous()
print("DIV")
img = img.float().div(255)
print("UNSQUEEZE")
img = img.unsqueeze(0)
return img
p = mp.Process(target=test, daemon=True)
p.start()
p.join()
This code will hang on the img.contiguous()
call, but only if the model is initialized at the top. If the model at the top is commented out, this works as expected. Further, note that the multiprocess function does not even use the model.
Versions
Collecting environment information...
/home/amol/code/soot/debugging/clip_tests/env/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31
Python version: 3.8.10 (default, Jun 2 2021, 10:49:15) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.11.0-34-generic-x86_64-with-glibc2.29
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] torch==1.7.1
[pip3] torchvision==0.8.2
[conda] Could not collect
cc @vfdev-5 @datumbox
I'm dealing with the same problem 2+ years later
same here... any updates on this?
I am currently going through the same problem after building a new environment.. torch.tensor() seems to be the culprit.. everything also seems to work perfectly on old environment. Any update? Can anyone help?