memray icon indicating copy to clipboard operation
memray copied to clipboard

Allocation record processing fails while trying to calculate the high watermark of a 1.5+ GB bin output file

Open bqback opened this issue 4 months ago • 9 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

When running memray flamegraph on a large file, during calculating the high watermark, I get Memray ERROR: Failed to process allocation record. Same error as #43, but the output file is orders of magnitude smaller than the file that caused the error.

A perfectly browsable html file is still produced. despite the error during calculation/processing.

Only some of the produced outputs cause this error, ~~slightly smaller (900 MB to 1.1-1.3 GB) files seem to be processed fine~~ no correlation to the size seems to exist (see section at the end)

Expected Behavior

A lack of errors produced by the memray flamegraph command.

Steps To Reproduce

Run memray flamegraph --force --temporal or memray flamegraph --force --leaks on a large file

Memray Version

1.15.0 on the server that collects the info

1.17.2 on a local setup that processes the output

Python Version

3.8 on the server

3.13 on the local setup

Operating System

Linux

Anything else?

Output of the parse | awk | sort | uniq command used on one of the files:

396110692 ALLOCATION
     78 CONTEXT_SWITCH
  44210 FRAME_ID
12919326 FRAME_POP
24431599 FRAME_PUSH
      1 HEADER
 511384 MEMORY_RECORD
      1 THREAD

The data is collected using memray --follow-forks --trace-python-allocators --force from a gunicorn app with 2*cores+1 workers.

Sizes of processed files

No error: 1.1 GB, 1.2 GB, 1.3 GB (2x), 1.5 GB, 1.6 GB, 1.8 GB, 1.9 GB (2x) Error: 906.6 MB, 1.1 GB, 1.5 GB, 1.6 GB, 1.8 GB (2x), 1.9 GB, 2.1 GB

bqback avatar Jul 30 '25 08:07 bqback

Thanks for the detailed bug report! I tried to reproduce this issue with some basic tests using large allocation files, but I wasn’t able to hit the same error condition.

To help us debug this further, could you provide one of the following:

  1. The problematic .bin file (if possible to share) - even if it’s large, we could work with it to identify the specific allocation pattern causing the issue
  2. A minimal reproducer - a script that generates allocation patterns similar to what’s causing your issue, so we can create a test case that reliably triggers this error

The fact that it’s happening consistently around the 1.5GB mark but not with smaller files (900MB-1.3GB) suggests there might be a specific threshold or edge case we’re hitting. Having a concrete reproduction case would really help us track down the root cause.

If the file contains sensitive data, even a sanitized version of the script that reproduces the issue would be helpful.​​​​​​​​​​​​​​​​

pablogsal avatar Jul 30 '25 09:07 pablogsal

I think 1 would be possible if the bin file can be semi-anonymized in some way, as for 2 -- it's a backend server for a full-scale webapp, so I don't think a reproducer can be made within a reasonable amount of time.

bqback avatar Jul 30 '25 11:07 bqback

I've just done what I should've done in the first place -- actually wrote down the sizes of every file I'm trying to process and whether the error occurred during processing.

It seems like my initial assumption that this is happening due to file size was thrown off by the first few outliers -- I now see that files as big as 1.9 GB were processed successfully, some files under 1.5 GB and even the 900 MB file mentioned in the post actually failed to process! So this is not a size issue, and I've now updated my original post to reflect this.

Is there a stack depth limit that could lead to this?

bqback avatar Jul 30 '25 11:07 bqback

My hypothesis as this time is that something is corrupting the files. This can be us somehow racing to write (unlikely but possible) or something in your file system.

We are going to need some way to discover what is writing incorrect information and what incorrect information is there so that’s why we would either need a reproducer or the files themselves.

Could you try to see if you get the errors running memray over some example script that produces some big file? Maybe allocating something in a loop with —trace-python-allocators. This way we can try to distinguish if is something special in your application or is your file system and hopefully we can use that as a reproducer.

pablogsal avatar Jul 30 '25 12:07 pablogsal

My hypothesis as this time is that something is corrupting the files. This can be us somehow racing to write (unlikely but possible) or something in your file system.

We are going to need some way to discover what is writing incorrect information and what incorrect information is there so that’s why we would either need a reproducer or the files themselves.

Could you try to see if you get the errors running memray over some example script that produces some big file? Maybe allocating something in a loop with —trace-python-allocators. This way we can try to distinguish if is something special in your application or is your file system and hopefully we can use that as a reproducer.

pablogsal avatar Jul 30 '25 12:07 pablogsal

The other option that I can think of is that you compile memray in the place you are using to read the file with some special debug changes to understand better what is going on at the time of read. That will teach us in what way the file is corrupted but not how it happened so not sure we would be able to fix it only with that

pablogsal avatar Jul 30 '25 12:07 pablogsal

Are output files good to parse at any point in time, or would the server have to be stopped for all related processes to finish writing? I just pulled them as is.

Another working theory is that space on the server ran out while memray was gathering data.

bqback avatar Jul 31 '25 07:07 bqback

Are output files good to parse at any point in time, or would the server have to be stopped for all related processes to finish writing? I just pulled them as is.

Another working theory is that space on the server ran out while memray was gathering data.

Unless you use aggregated files, output files are designed to be valid at any point in time precisely so you could debug after a crash or a OOM situation

pablogsal avatar Jul 31 '25 09:07 pablogsal

They're designed to be usable at any point in time, but that's different from being valid at any point in time... If you copied the file while Memray was in the middle of recording an allocation, that could absolutely explain the Memray ERROR: Failed to process allocation record warning that you saw. The failure to process the allocation record could absolutely have happened because only half of that allocation record's contents were in the file as of the time when you made the copy.

godlygeek avatar Jul 31 '25 15:07 godlygeek