Duplicate entries in `memray table --leaks` reporter.
Current Behavior
I am testing Memray's capabilities to identify memory leaks in python programs that use C extensions. I have a program that uses a library, which has a known memory leak coming from an extension code, and I am running Memray in several configurations to find the one shows the leak most prominently.
When I run my program with memray --native mode for 1 hr, my profile file is about 2GB without using aggregated mode or 6MB in aggregated mode (yay for this feature!).
When I create a table profile, it has about 25k rows, yet many of the rows are duplicates. I have converted the html report to a CSV, file and then did further aggregation to remove duplicates by adding together allocations that happened in the same place.
This reduced the number of rows to ~200; when I removed the Thread_ID column and aggregated the entries again, I ended up with only 50 rows, and had a meaningful information about my leak.
Here are sample entries from the table that demonstrate duplication
Thread_ID,Size,Allocator,Allocations,Location
0x1,192,malloc,1,operator new(unsigned long) at <unknown>:0
0x1,56,malloc,1,operator new(unsigned long) at <unknown>:0
...
0x1,944,malloc,1,_PyMem_RawMalloc at Objects/obmalloc.c:99
0x1,944,malloc,1,_PyMem_RawMalloc at Objects/obmalloc.c:99
...
0x25 (fn_api_status_handler),328,realloc,1,upb_Arena_InitSlow at <unknown>:0
0x38 (Thread-30 (_run)),328,realloc,1,upb_Arena_InitSlow at <unknown>:0
0x1c (Thread-6 (_run)),328,realloc,1,upb_Arena_InitSlow at <unknown>:0
...
0x25 (fn_api_status_handler),71,malloc,1,<unknown> at <unknown>:0
0x25 (fn_api_status_handler),87,malloc,1,<unknown> at <unknown>:0
...
In my case the application is creating several ephemeral threads, and thread ID info is not meaningful, it would be nice to have an option to exclude ThreadID from the report. But even that aside , it seems that we should be adding together allocations that are happening at the same location, increasing the Size and Total allocations instead.
I have also tried using --trace-python-allocation, which didn't have a meaningful impact.
I also verified that duplicate entries appear in table report when not using --native mode. It happens but at a smaller ratio: post-processing resulted in 3x reduction in number of rows.
Expected Behavior
In Table report, there should be only 1 row for each (Thread_ID, Size, Allocator, Location) tuple.
Bonus: provide the option to exclude ThreadID from the report.
Steps To Reproduce
My setup is somewhat involved, I am profiling a data processing pipeline, also discussed in https://github.com/bloomberg/memray/discussions/852. I am happy to attach the collected profiles if it helps. If we must have a repro for a meaningful investigation, let me know.
Memray Version
1.19.1
Python Version
3.10
Operating System
Linux
Anything else?
No response
This behavior reproducible on memrays' fibonacci exercise (https://github.com/bloomberg/memray/blob/main/docs/tutorials/exercise_1/fibonacci.py).
The table report contains duplicates, for example:
0x1 277.3 kB realloc 1 fibonacci at fibonacci.py:18
0x1 246.4 kB realloc 1 fibonacci at fibonacci.py:18
0x1 277.3 kB realloc 1 fibonacci at fibonacci.py:18
0x1 53.1 MB malloc 28104 fibonacci at fibonacci.py:18
0x1 57.0 MB malloc 29338 fibonacci at fibonacci.py:18
0x1 42.2 MB malloc 24770 fibonacci at fibonacci.py:18
This behavior reproducible on memrays' fibonacci exercise (
main/docs/tutorials/exercise_1/fibonacci.py).
actually, i spoke too soon, that report was created without --leaks. still i am not sure if we'd want to see duplicates there either
Repro for duplicate in table --leaks report:
cd docs/tutorials/exercise_3
memray run --native lru_cache.py
python -m memray table --leaks --force memray-lru_cache.py.3617871.bin
Output table has duplicate entries:
0x1 2.0 kB calloc 1 arena_map_get at Objects/obmalloc.c:1435
0x1 2.0 kB calloc 1 arena_map_get at Objects/obmalloc.c:1435
0x1 576 B malloc 1 _PyObject_Malloc at Objects/obmalloc.c:1966
0x1 528 B malloc 1 _PyObject_Malloc at Objects/obmalloc.c:1966
I think the table winds up with one entry per thread per distinct call stack leading to an allocation, even though we only show the final stack frame that called the allocator in the table.
Sounds like a plausible explanation, which might explain how the table could have grown to 20k+ entries due to usages of thread pools in my application.
From memray product standpoint, which direction you think we should be going to fix this: document + declare as WAI, or reduce the duplication when creating the report?
I don't have a very strong opinion here... from a product PoV, I think that the table reporter is nearly useless, although I will say that it's indeed more useful with --leaks than it is without - normally there's hundreds of pages of output that you can glean nothing useful from, at least with --leaks it's likely to be much smaller. A listing of each of the thousands or millions of allocations that your program made over the course of its run is quite uninteresting, though.
I'm fine with deduplicating by thread id + allocator + caller, that does seem like a reasonable change. I think I'd also be fine with changing the defaults to match the flamegraph reporter, where allocations from different threads are aggregated together by default, but there's a --split-threads command line argument to keep them separate. That's arguably a backwards incompatible change, but I doubt this reporter has many users, and any users it does have would likely be happy with the change... What do you think, @pablogsal ?
I don't have a very strong opinion here... from a product PoV, I think that the table reporter is nearly useless.
I have found value in the past in Table reporter for one of the leaks I investigated. Sorting the lines by Size of column allowed me to quickly find the line of code responsible for largest allocation, while a flamegraph in my scenario was pretty large and many spans were hard to read since they didn't all fit on the screen.
I think I'd also be fine with changing the defaults to match the flamegraph reporter, where allocations from different threads are aggregated together by default, but there's a --split-threads command line argument to keep them separate. That's arguably a backwards incompatible change, but I doubt this reporter has many users, and any users it does have would likely be happy with the change...
+1 for this change. If you'd like to make it backwards compatible, you could add an option --join-threads for the Table reporter, but then there will be some divergence between --join-threads for Table reporter and --split-threads for Flamegraph.