scalene
scalene copied to clipboard
Scalene hangs because it incorrectly grabs the GIL during arbitrary malloc calls.
Describe the bug
Scalene uses the Python C-API with the GIL held during arbitrary interposed malloc
calls. This is incorrect and causes valid programs to hang.
To Reproduce
Consider the following extension module:
#include <Python.h>
#include <stdio.h>
static void *
thread_start(void *arg)
{
void* p = malloc(12345);
free(p);
return NULL;
}
static PyObject *do_stuff(PyObject *self, PyObject *args) {
pthread_t threadId;
int s = pthread_create(&threadId, NULL, &thread_start, NULL);
pthread_join(threadId, NULL);
Py_RETURN_NONE;
}
static PyMethodDef EmbMethods[] = {
{"do_stuff", do_stuff, METH_NOARGS},
{NULL, NULL, 0, NULL}};
static PyModuleDef FooModule = {PyModuleDef_HEAD_INIT,
"foo",
NULL,
-1,
EmbMethods,
NULL,
NULL,
NULL,
NULL};
PyMODINIT_FUNC PyInit_foo() {
return PyModule_Create(&FooModule);
}
Now run scalene over the following module:
import foo
foo.do_stuff()
ensuring that the malloc
call will be traced (assign the sampling interval or force scalene
to sample it by modifying the source to handle this specific allocation). If you want a quick and dirty way to ensure this allocation is traced, just modify scalene with the following:
diff --git a/src/include/sampleheap.hpp b/src/include/sampleheap.hpp
index f8dd338..8c11ede 100644
--- a/src/include/sampleheap.hpp
+++ b/src/include/sampleheap.hpp
@@ -165,7 +165,7 @@ class SampleHeap : public SuperHeap {
} else {
_cCount += realSize;
}
- if (unlikely(sampleMalloc)) {
+ if (realSize == 12360 || unlikely(sampleMalloc)) {
process_malloc(sampleMalloc, ptr);
}
}
The program hangs because:
- Thread 1 has the GIL.
- Thread 1 creates thread 2 with the GIL held and joins it.
- Thread 2 calls malloc without the GIL.
- The malloc call into thread 2 goes into scalene, it samples the malloc call, calls into the C-API but for that it needs the GIL. It requests the GIL but will never receive it because thread 1 has it.
- We have a deadlock: thread 1 waits for thread 2 to finish and thread 2 waits for thread 1 to drop the GIL.
Note that not grabbing the GIL and calling into the C-API is not an option because the program can crash if calling into the C-API without the GIL held.
Expected behavior
scalene
doesn't hang
In short, requesting the GIL or calling into GIL-requiring Python C-API calls like PyGILState_GetThisThreadState
within a malloc
call is incorrect because another thread may be joining the thread that calls malloc
with the GIL held, and scalene
is incorrectly assuming that the GIL will be available here.
Hi Pablo, thank you for raising the issue, it's always appreciated.
This is not to say scalene shouldn't handle this better, but I'm wondering why would someone call pthread_join
while holding the GIL, rather than drop it as per the Python documentation?
This happens a lot if you are calling in any Python extension that goes into native code and this code spawns threads.
Extensions may need to do these calls with the GIL held to ensure synchronization with other resources if these threads are using Python objects.
For example, consider a library that spawns some kind of monitor thread and starts doing calculations with Python objects or buffer protocol objects and syncs with the monitor thread from time to time. The Gil cannot be dropped here as python objects are being manipulated and the moment the monitor thread (or whatever other thread that has been spawned that is not dealing with Python stuff) calls malloc
or similar, scalene
will deadlock. It would be possible to just drop in under join()
but that will force a lot of unnecessary context switching making the extension much slower (and the extension has all the right to not drop the GIL here because the other thread is not calling into Python).
Another case is a C++ thread pool that is spawned from a GIL owning thread. The destructor of said pool will likely want to join the threads but the moment free()
is called from the destructor or from another of the workers, scalene
will deadlock.
This is quite a common pattern in C++ bindings, specially the ones that run with callbacks and is even more common in bindings where the underlying library has been developed without knowing that will be wrapped and the binder author is in no control over how threads are spawned and joined.
while holding the GIL, rather than drop it as per the Python documentation?
Notice we indicate that in the CPython docs as a way to run without the GIL held but there is nothing forcing extensions not to do that. In fact, in CPython we have some places where we need to run native code (sqlite
extension IIRC) with the GIL held for correctness.
The more abstract problem description is that extension modules can make decisions on when and how to drop the GIL based on reasoning around what thread needs or doesn't need the GIL. scalene
violates these assumptions as is effectively making any thread that allocates request the GIL.