samply icon indicating copy to clipboard operation
samply copied to clipboard

Memory allocation tracing support

Open orlp opened this issue 2 months ago • 0 comments

I've just finished a prototype memory tracer for Rust which you can use simply by wrapping your global allocator:

use std::alloc::System; // Or whatever allocator you prefer.

#[global_allocator]
static ALLOCATOR: TracingAlloc<System> = TracingAlloc::new(System, Config::default());

Then by calling output_fxprof_report("test.fxprof") at the end of main you can get a profile for the Firefox Profiler. For example, here is a memory profile of a simple program building and using a regular expression: https://share.firefox.dev/3MW3GnJ.


I think it would be tremendously useful if I could integrate this into samply, so you not only get a memory profile but a simultaneous CPU profile. I believe the easiest way to get to a minimum viable product (without coupling samply to specifically my tracing crate) would be the following scheme:

  1. When samply record starts a process it passes two environment variable to the started process, SAMPLY_MEMTRACE_DIR and SAMPLY_MEMTRACE_VERSION. SAMPLY_MEMTRACE_DIR contains a temporary directory samply has created where it will expect files of the form <pid>.memtrace, using the specified file format version.

  2. The file format can evolve as we discover needs / Firefox Profiler evolves, but I would propose the following as a starter. The format is in plain ASCII text, with the first line containing MEMTRACE <version>, e.g. MEMTRACE 1.0 for the first version. Every line afterwards is an event, indicated by two bytes followed by a space, followed by that event's data:

    • IN <id> <value>, an interning event to help compress the stream, where <id> must be of the form [_a-zA-Z][_a-zA-Z0-9]*. After this, <id> can be used as a shorthand to refer to a value in any position except event codes. IDs may be re-used. For example, after IN i0 5442734589 the string i0 refers to 5442734589.
    • TS <ts>, sets the current time to the given amount of whole nanoseconds since the UNIX epoch. All <ts> values (with the exception of later TS commands) are relative to this timestamp, in whole nanoseconds. Initially the current time is 0.
    • GC <ts> <alloc_bytes> <dealloc_bytes> <alloc_ops> <dealloc_ops>, a global counter event reporting the overall number of bytes that have been (de)allocated and the total number of (de)allocation operations since the last GC call. It's intended this counter covers all allocations, not just traced ones, but note that this event (like all events) is optional.
    • TC <ts> <tid> <alloc_bytes> <dealloc_bytes> <alloc_ops> <dealloc_ops>, the same as GC except only counting (de)allocations which occurred in thread <tid>.
    • AL <ts> <tid> <n_bytes> <addr> [<instr_ptr> ...], an allocation event on thread <tid> of size <n_bytes> with starting address <addr>. An optional backtrace of space-separated instruction pointers follows, starting at the call closes to the allocation.
    • DE <ts> <tid> <n_bytes> <addr> [<instr_ptr> ...], a deallocation event, similarly to AL.

    The advantage of this file format over something like JSON is that it's really simple to write, and can also be written incrementally, reducing overhead on the to-be-traced process. Regarding the <tid> values, on Linux I think gettid is a good candidate, I'm not sure what the best choice for MacOS / Windows would be.

  3. After the recorded program has exited samply will check the SAMPLY_MEMTRACE_DIR for trace files and if it finds them it will process them and include them in the report.

If samply does the above which I believe isn't too difficult, I can detect it in my allocator wrapper and output the appropriate trace files. As a more ambitious but very interesting future endeavor one can also create a LD_PRELOAD allocator wrapper which supports the above scheme, which will then let samply do memory profiling on arbitrary binaries.


My questions to @mstange:

  1. Would you be interested in including memory allocation tracing support in samply?
  2. If yes, does the SAMPLY_MEMTRACE_DIR + SAMPLY_MEMTRACE_VERSION idea sound like a workable implementation?
  3. If yes, does the proposed MEMTRACE format seem reasonable?

I'd be willing to do some work on this with some mentorship around the finer details, particularly around the instruction pointer passing (e.g. do they need to be offset against anything?).

orlp avatar Dec 28 '25 14:12 orlp