memray Fetch line numbers for Python frames lazily

Up until now we've been greedily capturing the line number of each Python frame as we first encounter it, but this is unnecessarily inefficient. Since we've already been assuming that it's safe to call PyFrame_GetLineNumber without the GIL held (and we'd already need to change this code should that assumption become false in a future CPython version), we can delay calculating the Python line number until we're about to emit the frame. Since many frames are entered without ever being emitted (because no allocation is performed under them), this drastically reduces the number of calls to PyFrame_GetLineNumber, which is a deceptively slow function.

Apr 21 '22 09:04 godlygeek

Flipping this to Draft - this is subtly broken in a way that our test suite doesn't catch. I'll need to circle back to this - and add a new test...

Apr 21 '22 20:04 godlygeek

OK, I think this is now actually correct... And the new test case covers the problem that the original version of this had, so that we know we won't regress on it.

May 11 '22 06:05 godlygeek

Experiments executed for python -m memray run -o /dev/null -m test test_list test_list test_list test_list test_list test_list test_list test_list test_list test_list for a total duration of 1.2595737380001992 seconds in baseline.

With this PR

      Calls to PyFrame_GetLineNumber: 25550
     0.03%     0.03%  python  python3.10  [.] PyFrame_GetLineNumber                                                                                ▒

Baseline

      Calls to PyFrame_GetLineNumber: 270090
     0.39%     0.39%  python  python3.10                               [.] PyFrame_GetLineNumber                                                   ◆

This means that in general the most we can save (that we are indeed saving it) is 0.39% of the total runtime (this is assuming no PYTHONMALLOC=malloc where most savings will be there since calls to PyFrame_GetLineNumber in pushPythonFrame are a bigger percentage of the runtime because our hooks are called less).

For the case where PYTHONMALLOC=malloc for a total duration of 3.0231499819997225 seconds in baseline:

PR

      Calls to PyFrame_GetLineNumber: 21057170
+    5.26%     5.26%  python  python3.10                               [.] PyFrame_GetLineNumber

Baseline

      Calls to PyFrame_GetLineNumber: 21125440
+    7.23%     7.23%  python  python3.10                               [.] PyFrame_GetLineNumber

This means that in the latter we are saving ~2% maximum. This is consistent with the timing results of 2.8712007330000233 (PR) against 2.935535135000009 (baseline).

PR

  Test duration: 2.8712007330000233
  Time (mean ± σ):      3.058 s ±  0.051 s    [User: 3.008 s, System: 0.056 s]
  Range (min … max):    2.983 s …  3.167 s    10000 runs

Baseline

    Test duration: 2.935535135000009
  Time (mean ± σ):      3.120 s ±  0.046 s    [User: 3.070 s, System: 0.056 s]
  Range (min … max):    3.064 s …  3.210 s    10000 runs

That is around 3.5% of the difference.

Aug 17 '22 12:08 pablogsal