scalene Scalene hangs because it incorrectly grabs the GIL during arbitrary malloc calls.

Describe the bug Scalene uses the Python C-API with the GIL held during arbitrary interposed malloc calls. This is incorrect and causes valid programs to hang.

To Reproduce

Consider the following extension module:

#include <Python.h>
#include <stdio.h>

static void *
thread_start(void *arg)
{
    void* p = malloc(12345);
    free(p);
    return NULL;
}

static PyObject *do_stuff(PyObject *self, PyObject *args) {
  pthread_t threadId;
  int s = pthread_create(&threadId, NULL, &thread_start, NULL);
  pthread_join(threadId, NULL);
  Py_RETURN_NONE;
}

static PyMethodDef EmbMethods[] = {
    {"do_stuff", do_stuff, METH_NOARGS},
    {NULL, NULL, 0, NULL}};

static PyModuleDef FooModule = {PyModuleDef_HEAD_INIT,
                                  "foo",
                                  NULL,
                                  -1,
                                  EmbMethods,
                                  NULL,
                                  NULL,
                                  NULL,
                                  NULL};
PyMODINIT_FUNC PyInit_foo() {
  return PyModule_Create(&FooModule);
}

Now run scalene over the following module:

import foo
foo.do_stuff()

ensuring that the malloc call will be traced (assign the sampling interval or force scalene to sample it by modifying the source to handle this specific allocation). If you want a quick and dirty way to ensure this allocation is traced, just modify scalene with the following:

diff --git a/src/include/sampleheap.hpp b/src/include/sampleheap.hpp
index f8dd338..8c11ede 100644
--- a/src/include/sampleheap.hpp
+++ b/src/include/sampleheap.hpp
@@ -165,7 +165,7 @@ class SampleHeap : public SuperHeap {
     } else {
       _cCount += realSize;
     }
-    if (unlikely(sampleMalloc)) {
+    if (realSize == 12360 || unlikely(sampleMalloc)) {
       process_malloc(sampleMalloc, ptr);
     }
   }

The program hangs because:

Thread 1 has the GIL.
Thread 1 creates thread 2 with the GIL held and joins it.
Thread 2 calls malloc without the GIL.
The malloc call into thread 2 goes into scalene, it samples the malloc call, calls into the C-API but for that it needs the GIL. It requests the GIL but will never receive it because thread 1 has it.
We have a deadlock: thread 1 waits for thread 2 to finish and thread 2 waits for thread 1 to drop the GIL.

Note that not grabbing the GIL and calling into the C-API is not an option because the program can crash if calling into the C-API without the GIL held.

Expected behavior scalene doesn't hang

May 11 '22 19:05 pablogsal

In short, requesting the GIL or calling into GIL-requiring Python C-API calls like PyGILState_GetThisThreadState within a malloc call is incorrect because another thread may be joining the thread that calls malloc with the GIL held, and scalene is incorrectly assuming that the GIL will be available here.

May 11 '22 19:05 pablogsal

Hi Pablo, thank you for raising the issue, it's always appreciated.

This is not to say scalene shouldn't handle this better, but I'm wondering why would someone call pthread_join while holding the GIL, rather than drop it as per the Python documentation?

May 13 '22 13:05 jaltmayerpizzorno

This happens a lot if you are calling in any Python extension that goes into native code and this code spawns threads.

Extensions may need to do these calls with the GIL held to ensure synchronization with other resources if these threads are using Python objects.

For example, consider a library that spawns some kind of monitor thread and starts doing calculations with Python objects or buffer protocol objects and syncs with the monitor thread from time to time. The Gil cannot be dropped here as python objects are being manipulated and the moment the monitor thread (or whatever other thread that has been spawned that is not dealing with Python stuff) calls malloc or similar, scalene will deadlock. It would be possible to just drop in under join() but that will force a lot of unnecessary context switching making the extension much slower (and the extension has all the right to not drop the GIL here because the other thread is not calling into Python).

Another case is a C++ thread pool that is spawned from a GIL owning thread. The destructor of said pool will likely want to join the threads but the moment free() is called from the destructor or from another of the workers, scalene will deadlock.

This is quite a common pattern in C++ bindings, specially the ones that run with callbacks and is even more common in bindings where the underlying library has been developed without knowing that will be wrapped and the binder author is in no control over how threads are spawned and joined.

while holding the GIL, rather than drop it as per the Python documentation?

Notice we indicate that in the CPython docs as a way to run without the GIL held but there is nothing forcing extensions not to do that. In fact, in CPython we have some places where we need to run native code (sqlite extension IIRC) with the GIL held for correctness.

The more abstract problem description is that extension modules can make decisions on when and how to drop the GIL based on reasoning around what thread needs or doesn't need the GIL. scalene violates these assumptions as is effectively making any thread that allocates request the GIL.

May 13 '22 16:05 pablogsal

scalene scalene copied to clipboard

Scalene hangs because it incorrectly grabs the GIL during arbitrary malloc calls.

scalene
scalene copied to clipboard