issue with pycuda and multiprocessing
I created a simple gist that exposes an error for me on some machines and not others. It init's the driver, then spawns a child process and init's the driver there. https://gist.github.com/wconstab/f06362277a6235aa87bdc4235bfde731
Running py.test test_pycuda_multi.py on some machines causes this error in the child process:
LogicError: cuInit failed: initialization error
There are differences in [OS, docker being used or not, py2/3, nvidia driver] between the machines where this gist works or crashes, so I won't attempt to label it as specific to one configuration.
It would be helpful to know whether this should work or is expected to be undefined.
It init's the driver, then spawns a child process
According to Nvidia, you may not fork() after intializing CUDA.
Can you give a pointer to their mention of this?
I'm not too surprised about initialize CUDA, fork, initialize CUDA being unsupported.
Hoping that it's at least OK to initialize CUDA, fork+exec a clean process, then initialize CUDA again.
as another data point, we have observed that driver versions up to 361.42 seem to allow this, but fail for newer versions
this thread seems to document the behavior
https://devtalk.nvidia.com/default/topic/973477/cuda-programming-and-performance/-cuda8-0-bug-child-process-forked-after-cuinit-get-cuda_error_not_initialized-on-cuinit-/
It init's the driver, then spawns a child process
According to Nvidia, you may not
fork()after intializing CUDA.
you can look for this issure. https://devtalk.nvidia.com/default/topic/1052699/tensorrt/how-to-implement-tensorrt-as-an-inference-server-/post/5343685/#reply pycuda doesn't work in childprocess. it can't autoinit cuda device. so strange
This error is due to the multiprocessing functionality in DataLoader class of Pytorch. If you are able to find some other solution then I would urge you to use that one. But a shorcut to get rid of that error is as follows: Just change the "workers" argument in DataLoader class and make it equal to zero. This will resolve the error temporarily at the cost of increased computational time per epoch.