cakeml icon indicating copy to clipboard operation
cakeml copied to clipboard

frame pointers / perf-record --call-graph support

Open sorear opened this issue 5 years ago • 0 comments

On Linux the easiest way for us to support runtime performance analysis is via the kernel statistical profiler, perf(1). perf can, on the symbols branch, be used to generate flat profiles, but nested/call graph profiling requires the profiler to be able to unwind the stack. perf supports three ways to unwind the stack:

  • dwarf uses DWARF debug information in the binary. While it would be nice to support this eventually for debugging mixed verified and unverified code with gdb, that would be a long way off, and also the DWARF walker is restricted to examining memory within a small distance of the ABI stack pointer, so it could not be used with the current CakeML code generation (#757). Since DWARF is loaded from executable files on disk, this option is also problematic for eval.
  • lbr uses the "last branch record" feature of Intel CPUs. This feature is not supported on AMD hardware; it is unclear if it is supported on non-x86 hardware from any vendor. This uses a hardware-managed stack of limited size (32 entries on Skylake) which is updated by CALL and RET instructions; since we do not use either (#758) it does not produce a useful profile.
  • fp walks the stack using ABI-defined frame pointers. This has the same limitations regarding the stack as dwarf, but does not require debug information, instead it requires the compiler to maintain a linked list of self-describing stack frames.

Once #758 is done this should work transparently on Intel x64 hardware. To support non-Intel and non-x86 hardware, I propose to add optional frame pointer support to the compiler. In detail (for x64):

  • %rbp points to the currently active stack frame at all instruction boundaries (the stack is sampled at timer or PMU interrupts). It cannot be allocated when frame pointers are in use.
  • %rbp points to 2 words; 0(%rbp) is the rbp value for the previous stack frame, while 8(%rbp) is the current frame's return address.
  • Data for a frame goes below rbp.
  • Immediately after a call and before allocating space for stack slots, a frame can be created by push %rbp; mov %rsp, %rbp. This uses the return address pushed by the hardware.
  • Immediately before returning, restore the old rbp value with pop %rbp.

The basic structure is the same for other architectures but the details (in particular the offsets of the save and return address from the frame pointer, and the register used) differ; try to match what gcc -fno-omit-frame-pointer does on that architecture.

sorear avatar Sep 17 '20 17:09 sorear