pyopencl icon indicating copy to clipboard operation
pyopencl copied to clipboard

Deadlocks when accessing Context with active GL interop

Open s-ol opened this issue 7 years ago • 10 comments
trafficstars

I'm having trouble porting my array-based code to an interop-based renderer.

I'm instantating the array like this: using an allocator:

		def gl_buffer_allocator(size):
			ubo = glGenBuffers(1)
			glBindBuffer(GL_UNIFORM_BUFFER, ubo)
			glBufferStorage(GL_UNIFORM_BUFFER, size, None, GL_MAP_READ_BIT | GL_MAP_WRITE_BIT)
			glBindBuffer(GL_UNIFORM_BUFFER, 0)
			return GLBuffer(ctx, mem_flags.READ_WRITE, int(ubo))

to_device cannot work because it doesn't acquire the GLBuffer, and I cannot do that beforehand since the buffer isn't allocated yet. It works like this:

			self.grid = Array(queue, self.grid_array.shape, self.grid_array.dtype, allocator=allocator)
			self.grid.queue = None # didn't want to associate a queue yet
			self.grid.allocator = None # make sure `.get()` doesn't allocate GLBuffers

for some reason passing a Context instead of a CommandQueue makes this lock up here. Freeing the context seems like the wrong thing to do...?

#0  0x00007fffeb5749aa in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#1  0x00007fffeb1b2190 in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#2  0x00007fffeb4c9f32 in ?? () from /usr/lib/libnvidia-glcore.so.396.24
#3  0x00007fffec70b6a3 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#4  0x00007fffe8853544 in ?? () from /usr/lib/libnvidia-opencl.so.1
#5  0x00007fffe8752881 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe8751595 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007ffff032eb94 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#8  0x00007ffff0572d95 in context::~context() () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so
#9  0x00007ffff057303a in context::~context() () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so
#10 0x00007ffff055f9fd in ?? () from /usr/lib/python3.6/site-packages/pyopencl/_cffi.abi3.so

The same deadlock is preventing me from using grid.with_queue(), grid.setitem() etc. I realized later that I can trigger it just by accessing the context attribute of a CommandQueue:

	def step(self):
		with CommandQueue(self.ctx) as queue:
			cl.enqueue_acquire_gl_objects(queue, [self.grid.base_data])
                        # uncomment to lock
			# queue.context 
			self.grid.set(self.grid_array, queue=queue)
			cl.enqueue_acquire_gl_objects(queue, [self.grid.base_data])
		print('got here')

interestingly it runs until 'got here' but I never see the result of the set call. The step() method also never returns for me. If I debug the script in pudb, the interface closes as I step out of the method.

I'll see if i can create a small reproducable example now.

s-ol avatar May 26 '18 15:05 s-ol

here we go:

from OpenGL.GL import *
from OpenGL.GLUT import *
import pyopencl as cl
import pyopencl.array
import numpy as np

def get_ctx():
	from pyopencl.tools import get_gl_sharing_context_properties
	import sys

	platform = cl.get_platforms()[0]

	if sys.platform == "darwin":
		return cl.Context(properties=get_gl_sharing_context_properties(),
				devices=[])
	else:
		# Some OSs prefer clCreateContextFromType, some prefer
		# clCreateContext. Try both.
		try:
			return cl.Context(properties=[
				(cl.context_properties.PLATFORM, platform)]
				+ get_gl_sharing_context_properties())
		except:
			return cl.Context(properties=[
				(cl.context_properties.PLATFORM, platform)]
				+ get_gl_sharing_context_properties(),
				devices = [platform.get_devices()[0]])

			glutInit()

def gl_allocator(size):
	ubo = glGenBuffers(1)
	glBindBuffer(GL_UNIFORM_BUFFER, ubo)
	glBufferStorage(GL_UNIFORM_BUFFER, size, None, GL_MAP_READ_BIT | GL_MAP_WRITE_BIT)
	glBindBuffer(GL_UNIFORM_BUFFER, 0)
	return cl.GLBuffer(ctx, cl.mem_flags.READ_WRITE, int(ubo))

glutInit()
glutInitWindowSize(512, 512)
glutCreateWindow('gpWFC')
glutDisplayFunc(lambda: 0)

ctx = get_ctx()
data = np.arange(100)
with cl.CommandQueue(ctx) as queue:
	arr = cl.array.Array(queue, data.shape, data.dtype, allocator=gl_allocator)

def key(*args):
	print("key pressed")
	with cl.CommandQueue(ctx) as queue:
		cl.enqueue_acquire_gl_objects(queue, [arr.base_data])
		queue.context
		arr.set(data, queue=queue)
		cl.enqueue_release_gl_objects(queue, [arr.base_data])
glutKeyboardFunc(key)
glutMainLoop()

let this open, press any key once and it should lock up. My system info is in this comment.

s-ol avatar May 26 '18 15:05 s-ol

It works like this:

I'd discourage in-place modification of Array instances. Instead, simply pass your buffer to the constructor via the data= kwarg.

Freeing the context seems like the wrong thing to do...?

That's weird. OpenCL is reference counted, so all clReleaseContext should do is decrease the refcount--unless that was indeed the last reference to the context.

inducer avatar May 30 '18 18:05 inducer

@inducer I tried that but it also triggered the hang. Maybe clReleaseContext is only decreasing the reference and there is something else going on - I assumed whats in the title from the backtrace only.

If you don't have time to look into this, could you recommend a debugging strategy?

EDIT: leaving this link here for reference, I'll check my dmesg output next time and also see if I can get a test setup on Windows.

s-ol avatar May 31 '18 10:05 s-ol

@inducer have you had a chance to take a look at the minimal example I provided above?

s-ol avatar Oct 21 '18 06:10 s-ol

I have not, sorry. But you may want to retry with git master, which is a whole different code base (actually, mostly a revival of the old Boost.Python code on top of pybind11).

inducer avatar Oct 21 '18 07:10 inducer

I have not, sorry. But you may want to retry with git master, which is a whole different code base (actually, mostly a revival of the old Boost.Python code on top of pybind11).

Great, I've given it a shot but I am experiencing some issues with NVIDIA Optimus / Bumblebee on my laptop: Bumblebee-Project/Bumblebee#778

Xlib:  extension "NV-GLX" missing on display ":0"

Having dealt with these things in the past though, I think the fix is just waiting until I get back to my desktop PC where optimus doesn't get in the way.

s-ol avatar Oct 29 '18 05:10 s-ol

Unfortunately still experiencing the same problem:

(gdb) bt
#0  0x00007f98d5c76853 in  () at /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007f98b0999478 in  ()
#2  0x00007ffd5f107d58 in  ()
#3  0x00007ffd5f107d58 in  ()
#4  0x000055e2a22f3870 in  ()
#5  0x00007f98d6cf0ebd in  () at /usr/lib/libGLX_nvidia.so.0
#6  0x00007f98d5c3901d in  () at /usr/lib/libnvidia-glcore.so.415.25
#7  0x00007f98d5bf0d02 in  () at /usr/lib/libnvidia-glcore.so.415.25
#8  0x00007f98d6cb8033 in glcuR0d4nX () at /usr/lib/libGLX_nvidia.so.0
#9  0x00007f98d2e1d794 in  () at /usr/lib/libnvidia-opencl.so.1
#10 0x00007f98d2d1b7d1 in  () at /usr/lib/libnvidia-opencl.so.1
#11 0x00007f98d2d1a4e5 in  () at /usr/lib/libnvidia-opencl.so.1
#12 0x00007f98d92deef4 in clReleaseContext () at /usr/lib/libOpenCL.so.1
#13 0x00007f98d8cdad8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#14 0x00007f98d8cda6a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#15 0x00007f98d8cce01f in pybind11_object_dealloc ()
    at /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#16 0x00007f98e4e5dd9e in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#17 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#18 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#19 0x00007f98e4e57c42 in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#20 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#21 0x00007f98e4d9fdec in _PyFunction_FastCallDict () at /usr/lib/libpython3.7m.so.1.0
#22 0x00007f98e4e5943c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#23 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#24 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
#25 0x00007f98e4e58b7d in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#26 0x00007f98e4d9fc0b in _PyFunction_FastCallDict () at /usr/lib/libpython3.7m.so.1.0
#27 0x00007f98e4e5943c in _PyEval_EvalFrameDefault () at /usr/lib/libpython3.7m.so.1.0
#28 0x00007f98e4d9eb99 in _PyEval_EvalCodeWithName () at /usr/lib/libpython3.7m.so.1.0
#29 0x00007f98e4de5492 in _PyFunction_FastCallKeywords () at /usr/lib/libpython3.7m.so.1.0
--Type <RET> for more, q to quit, c to continue without paging--

This is my own code, but interestingly enough I now have the same problem running examples/gl_interop_demo.py:

(gdb) bt
#0  0x00007fffe8b11896 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007fffe8b3e5fc in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#2  0x00007fffe87657b0 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#3  0x00007fffe8a8bd02 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#4  0x00007fffe9b53033 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#5  0x00007fffe5cd8794 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe5bd67d1 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007fffe5bd54e5 in ?? () from /usr/lib/libnvidia-opencl.so.1
#8  0x00007ffff7203ef4 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#9  0x00007ffff5411d8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff54116a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#11 0x00007ffff540501f in pybind11_object_dealloc ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#12 0x00007ffff7b664c0 in _PyFunction_FastCallKeywords () from /usr/lib/libpython3.7m.so.1.0
#13 0x00007ffff7bd8dfa in _PyEval_EvalFrameDefault () from /usr/lib/libpython3.7m.so.1.0
#14 0x00007ffff7b1fb99 in _PyEval_EvalCodeWithName () from /usr/lib/libpython3.7m.so.1.0
#15 0x00007ffff7b20ab4 in PyEval_EvalCodeEx () from /usr/lib/libpython3.7m.so.1.0
#16 0x00007ffff7b20adc in PyEval_EvalCode () from /usr/lib/libpython3.7m.so.1.0
#17 0x00007ffff7c4ac94 in ?? () from /usr/lib/libpython3.7m.so.1.0
#18 0x00007ffff7c4c8be in PyRun_FileExFlags () from /usr/lib/libpython3.7m.so.1.0
#19 0x00007ffff7c4dc75 in PyRun_SimpleFileExFlags () from /usr/lib/libpython3.7m.so.1.0
#20 0x00007ffff7c4feb7 in ?? () from /usr/lib/libpython3.7m.so.1.0
#21 0x00007ffff7c500fc in _Py_UnixMain () from /usr/lib/libpython3.7m.so.1.0
#22 0x00007ffff7dae223 in __libc_start_main () from /usr/lib/libc.so.6
#23 0x000055555555505e in _start ()

However examples/gl_particle_animation.py works fine...

s-ol avatar Jan 16 '19 12:01 s-ol

What are the differences in the context setup code between examples/gl_interop_demo.py and examples/gl_particle_animation.py? What happens if you graft the context setup code from one onto the other?

inducer avatar Jan 17 '19 16:01 inducer

in examples/gl_particle_animation.py the context is created simply by

platform = cl.get_platforms()[0]
ctx = cl.Context(properties=[(cl.context_properties.PLATFORM, platform)] + get_gl_sharing_context_properties())  

while in examples/gl_interop_demo.py there is this a bit more elaborate block:

platform = cl.get_platforms()[0]

from pyopencl.tools import get_gl_sharing_context_properties
import sys
if sys.platform == "darwin":
    ctx = cl.Context(properties=get_gl_sharing_context_properties(),
            devices=[])
else:
    # Some OSs prefer clCreateContextFromType, some prefer
    # clCreateContext. Try both.
    try:
        ctx = cl.Context(properties=[
            (cl.context_properties.PLATFORM, platform)]
            + get_gl_sharing_context_properties())
    except:
        ctx = cl.Context(properties=[
            (cl.context_properties.PLATFORM, platform)]
            + get_gl_sharing_context_properties(),
            devices = [platform.get_devices()[0]])

replacing the second with the first doesn't change the outcome though:

(gdb) bt
#0  0x00007fffe8b3e5f2 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#1  0x00007fffe87657b0 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#2  0x00007fffe8a8bd02 in ?? () from /usr/lib/libnvidia-glcore.so.415.25
#3  0x00007fffe9b53033 in glcuR0d4nX () from /usr/lib/libGLX_nvidia.so.0
#4  0x00007fffe5cd8794 in ?? () from /usr/lib/libnvidia-opencl.so.1
#5  0x00007fffe5bd67d1 in ?? () from /usr/lib/libnvidia-opencl.so.1
#6  0x00007fffe5bd54e5 in ?? () from /usr/lib/libnvidia-opencl.so.1
#7  0x00007ffff7203ef4 in clReleaseContext () from /usr/lib/libOpenCL.so.1
#8  0x00007ffff5411d8b in std::_Sp_counted_ptr<pyopencl::context*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#9  0x00007ffff54116a4 in pybind11::class_<pyopencl::context, std::shared_ptr<pyopencl::context> >::dealloc(pybind11::detail::value_and_holder&) ()
   from /home/s-ol/Documents/other/gpWFC/venv/lib/python3.7/site-packages/pyopencl-2018.2.2-py3.7-linux-x86_64.egg/pyopencl/_cl.cpython-37m-x86_64-linux-gnu.so
#10 0x00007ffff540501f in pybind11_object_dealloc ()

Also I finally managed to load the python GDB utils but it doesn't give any more information (because my python version is not compiled for debugging I assume):

(gdb) thread apply all py-bt-full

Thread 11 (Thread 0x7fffe1554700 (LWP 31451)):
Unable to locate python frame

Thread 10 (Thread 0x7fffe1d55700 (LWP 31450)):
Unable to locate python frame

Thread 9 (Thread 0x7fffe2556700 (LWP 31449)):
Unable to locate python frame

Thread 8 (Thread 0x7fffe2d57700 (LWP 31448)):
Unable to locate python frame

Thread 7 (Thread 0x7fffe3558700 (LWP 31447)):
Unable to locate python frame

Thread 6 (Thread 0x7fffe3f61700 (LWP 31446)):
#0 Waiting for the GIL

Thread 5 (Thread 0x7fffe4762700 (LWP 31445)):
Unable to locate python frame

Thread 4 (Thread 0x7fffedb81700 (LWP 31436)):
Unable to locate python frame

Thread 3 (Thread 0x7ffff2382700 (LWP 31435)):
Unable to locate python frame

Thread 2 (Thread 0x7ffff2b83700 (LWP 31434)):
Unable to locate python frame

Thread 1 (Thread 0x7ffff7883600 (LWP 31416)):
#12 (unable to read python frame information)

s-ol avatar Jan 19 '19 12:01 s-ol

So the fact that the backtrace contains clReleaseContext points to the notion that the Nvidia runtime has some bug that makes it not like decreasing the context refcount (perhaps: doing so while GL interop is still active). Something to try would be to make sure you hold on to a handle to the context somewhere, to make sure it doesn't get released prematurely.

inducer avatar Jan 20 '19 19:01 inducer