memray icon indicating copy to clipboard operation
memray copied to clipboard

Incorrect hybrid stacks on Python 3.11

Open godlygeek opened this issue 3 years ago • 6 comments

Because _PyEval_EvalFrameDefault can allocate memory in Python 3.11 before it calls our profile function, we can fail to correctly pair up the Python stack with the native stack.

Moderately horrific reproducer:

import functools
import memray


def foo():
    list(map(bar, [None]))


def bar(_):
    x = 0
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    x += 1
    print(x)


with memray.Tracker(
    destination=memray.FileDestination("output.bin", overwrite=True),
    native_traces=True,
):
    functools.partial(foo)()

with memray.FileReader("output.bin") as f:
    for i, allocation in enumerate(f.get_high_watermark_allocation_records(), 1):
        print(f"\nAllocation {i}:")
        for frame in allocation.hybrid_stack_trace():
            print(frame[0])

The printed hybrid stack is:

_PyMem_RawMalloc
PyMem_RawMalloc
_PyObject_Malloc
_PyObject_Malloc
PyMem_Malloc
_PyCode_CreateLineArray
_PyCode_InitLineArray
call_trace
call_trace_protected
trace_function_entry
foo
_PyEval_EvalFrame
_PyEval_Vector
_PyFunction_Vectorcall
_PyObject_VectorcallTstate
map_next
list_extend
list___init___impl
list_vectorcall
_PyObject_VectorcallTstate
PyObject_Vectorcall
trace_call_function
<module>

which shows map_next calling into foo, rather than the other way around.

godlygeek avatar Sep 13 '22 00:09 godlygeek

A bit precarious, but we can use something akin to this to fix it:

diff --git a/src/memray/_memray.pyx b/src/memray/_memray.pyx
index b739153..f87c5a5 100644
--- a/src/memray/_memray.pyx
+++ b/src/memray/_memray.pyx
@@ -214,11 +214,18 @@ cdef class AllocationRecord:

         cdef size_t python_frame_index = 0
         cdef size_t num_python_frames = len(python_stack)
+        cdef bool skip_eval_frame = False
         for native_frame in native_stack:
             if python_frame_index >= num_python_frames:
                 break
             symbol = native_frame[0]
+            if symbol == "call_trace_protected":
+                skip_eval_frame = True
             if self._is_eval_frame(symbol):
+                if skip_eval_frame:
+                    yield native_frame
+                    skip_eval_frame = False
+                    continue
                 while python_frame_index < num_python_frames:
                     python_frame = python_stack[python_frame_index]
                     python_frame_index += 1

This produces the following:

_PyObject_Malloc
_PyCode_CreateLineArray
call_trace
call_trace_protected
_PyEval_EvalFrameDefault
_PyEval_Vector
map_next
list_extend
list_vectorcall
PyObject_Vectorcall
foo
_PyEval_Vector
partial_vectorcall
PyObject_Vectorcall
<module>

I checked and call_trace_protected can be inlined if forced by adding __attribute("always_inline")__ but adding inline will never inline _call_trace_protected. This doesn't make this method less flaky but in all interpreters I checked the function is always not inlined.

The propper fix will likely be using the custom evaluation function but this can be a fallback.

pablogsal avatar Sep 13 '22 11:09 pablogsal

Doesn't work when we don't have symbols - https://github.com/bloomberg/memray/pull/209 works for me locally but not in CI.

godlygeek avatar Sep 15 '22 00:09 godlygeek

Then there is no way to detect this problem in those cases. This means that we need to either won't fix it skip the test in the absence of symbols.

pablogsal avatar Sep 15 '22 09:09 pablogsal

This is another possibility:

diff --git a/src/memray/_memray/tracking_api.cpp b/src/memray/_memray/tracking_api.cpp
index 72a1059..6777798 100644
--- a/src/memray/_memray/tracking_api.cpp
+++ b/src/memray/_memray/tracking_api.cpp
@@ -469,6 +469,13 @@ PythonStackTracker::recordAllStacks()
     s_tracker_generation++;
 }

+PyObject*
+PyEvalFrame_Memray(PyThreadState* ts, _PyInterpreterFrame* f, int val) {
+    RecursionGuard::isActive = true;
+    return _PyEval_EvalFrameDefault(ts, f, val);
+}
+
+
 void
 PythonStackTracker::installProfileHooks()
 {
@@ -483,6 +490,9 @@ PythonStackTracker::installProfileHooks()
     // they can't start profiling before we capture their stack and miss it.
     compat::setprofileAllThreads(nullptr, nullptr);

+    PyThreadState* tstate = PyThreadState_GET();
+    _PyInterpreterState_SetEvalFrameFunc(tstate->interp, PyEvalFrame_Memray);
+
     // Find and record the Python stack for all existing threads.
     recordAllStacks();

@@ -495,6 +505,8 @@ PythonStackTracker::removeProfileHooks()
 {
     assert(PyGILState_Check());
     compat::setprofileAllThreads(nullptr, nullptr);
+    PyThreadState* tstate = PyThreadState_GET();
+    _PyInterpreterState_SetEvalFrameFunc(tstate->interp, NULL);
     std::unique_lock<std::mutex> lock(s_mutex);
     s_initial_stack_by_thread.clear();
 }
@@ -742,6 +754,11 @@ Tracker::trackAllocationImpl(void* ptr, size_t size, hooks::Allocator func)
     }
     RecursionGuard guard;

+    PyThreadState* tstate = PyGILState_GetThisThreadState();
+    if (tstate->tracing > 0) {
+        return;
+    }
+
     PythonStackTracker::get().emitPendingPushesAndPops();

     if (d_unwind_native_frames) {
@@ -1062,6 +1079,7 @@ PyTraceFunction(
         int what,
         [[maybe_unused]] PyObject* arg)
 {
+    RecursionGuard::isActive = false;
     RecursionGuard guard;
     if (!Tracker::isActive()) {
         return 0;
@@ -1125,6 +1143,8 @@ install_trace_function()
     PyEval_SetProfile(PyTraceFunction, profileobj);
     Py_DECREF(profileobj);

+    _PyInterpreterState_SetEvalFrameFunc(ts->interp, PyEvalFrame_Memray);
+
     PyFrameObject* frame = PyEval_GetFrame();

     // Push all of our Python frames, most recent last.  If we reached here

but at this point as we are going to lose optimizations better to regain them by actually using the profile function.

pablogsal avatar Sep 15 '22 12:09 pablogsal

I came up with this same idea last night, and was going to suggest it today! 😄

It does mean sacrificing some optimizations, though - but at least we'd gain correctness...

godlygeek avatar Sep 15 '22 15:09 godlygeek

It does mean sacrificing some optimizations, though

Yeah, I honestly prefer to sacrifice correctness here unless we leverage the eval function in its entirety. If we are going to depend on it, better to depend on it fully and leverage all possibilities.

I do not think is ok to sacrifice a lot of speed for all users for something that's likely not going to appear in 99% of cases.

pablogsal avatar Sep 15 '22 15:09 pablogsal

The reason why we are abandoning the idea of the custom eval function:

https://github.com/bloomberg/memray/pull/213#issuecomment-1259815614

pablogsal avatar Sep 27 '22 21:09 pablogsal