Fetch line numbers for Python frames lazily
Up until now we've been greedily capturing the line number of each
Python frame as we first encounter it, but this is unnecessarily
inefficient. Since we've already been assuming that it's safe to call
PyFrame_GetLineNumber without the GIL held (and we'd already need to
change this code should that assumption become false in a future CPython
version), we can delay calculating the Python line number until we're
about to emit the frame. Since many frames are entered without ever
being emitted (because no allocation is performed under them), this
drastically reduces the number of calls to PyFrame_GetLineNumber,
which is a deceptively slow function.
Flipping this to Draft - this is subtly broken in a way that our test suite doesn't catch. I'll need to circle back to this - and add a new test...
OK, I think this is now actually correct... And the new test case covers the problem that the original version of this had, so that we know we won't regress on it.
Experiments executed for python -m memray run -o /dev/null -m test test_list test_list test_list test_list test_list test_list test_list test_list test_list test_list for a total duration of 1.2595737380001992 seconds in baseline.
With this PR
Calls to PyFrame_GetLineNumber: 25550
0.03% 0.03% python python3.10 [.] PyFrame_GetLineNumber ▒
Baseline
Calls to PyFrame_GetLineNumber: 270090
0.39% 0.39% python python3.10 [.] PyFrame_GetLineNumber ◆
This means that in general the most we can save (that we are indeed saving it) is 0.39% of the total runtime (this is assuming no PYTHONMALLOC=malloc where most savings will be there since calls to PyFrame_GetLineNumber in pushPythonFrame are a bigger percentage of the runtime because our hooks are called less).
For the case where PYTHONMALLOC=malloc for a total duration of 3.0231499819997225 seconds in baseline:
PR
Calls to PyFrame_GetLineNumber: 21057170
+ 5.26% 5.26% python python3.10 [.] PyFrame_GetLineNumber
Baseline
Calls to PyFrame_GetLineNumber: 21125440
+ 7.23% 7.23% python python3.10 [.] PyFrame_GetLineNumber
This means that in the latter we are saving ~2% maximum. This is consistent with the timing results of 2.8712007330000233 (PR) against 2.935535135000009 (baseline).
PR
Test duration: 2.8712007330000233
Time (mean ± σ): 3.058 s ± 0.051 s [User: 3.008 s, System: 0.056 s]
Range (min … max): 2.983 s … 3.167 s 10000 runs
Baseline
Test duration: 2.935535135000009
Time (mean ± σ): 3.120 s ± 0.046 s [User: 3.070 s, System: 0.056 s]
Range (min … max): 3.064 s … 3.210 s 10000 runs
That is around 3.5% of the difference.