jerryscript Heap snapshot API

This change introduces a new public API function, jerry_heap_snapshot_capture, permitting developers to inspect the current state of the JerryScript heap to diagnose memory-usage problems.

The heap is exposed as a directed graph, where nodes represent allocations (w/ their type, size, and textual representation) and edges represent references (w/ type and name). Once a snapshot is taken, the heap might be viewed directly as a graph, or processed further to produce a tree-view of allocations ranked by size, etc.

The API itself is callback-based, with the developer providing function pointers to be called for every node and edge in the heap. They can then write these entries to a file, build their own in-memory model, filter for certain object types/sizes, etc. as their needs dictate.

By default, the entire feature is disabled, hopefully having little to no memory or performance impact. It can be enabled in stages:

JERRY_FEATURE_HEAP_SNAPSHOT - enables heap snapshots, but does not expose allocation sizes (all allocations reported as 0 bytes). Fixed memory footprint.
JERRY_FEATURE_HEAP_SNAPSHOT + JERRY_FEATURE_MEM_TRACK_ALLOCATION_SIZES - also reports allocation sizes, at the cost of an additional memory overhead equal to ~1/64th of the allocated heap size. Internal heap only.
JERRY_FEATURE_MEM_TRACK_ALLOCATION_SIZES - if the developer already has heap pointers and just needs to know their allocation sizes.

Overview of changes

Parameterize ecma_gc_mark (renamed ecma_gc_traverse_inner) to permit a callback to be called for every allocation that would normally be marked during an GC run. Similar changes were made to downstream functions like ecma_gc_mark_property.
Add new internal ecma_gc_walk_heap function that takes a callback and enumerates regular heap allocations, literals, bytecode objects, and "magic strings". While the "magic strings" are not actually heap-allocated, they are included in the enumeration as on-heap allocations may reference them.
New public API jerry_heap_snapshot_capture, which is a wrapper around ecma_gc_walk_heap to resolve allocation types/sizes, and generate pairs of node/edge callbacks for each allocation as appropriate.
Explicit tracking of all heap allocation sizes via jmem_heap_allocation_size. Normally we rely on user code to know the sizes of the allocations it's working with. However, when all we have is a pointer to an unknown allocation, a standalone function is required. This is implemented through an allocation boundary bitmap, nominally sized at ~1/64th of the heap it represents. This is only implemented for internal heaps, though could be extended to all heap modes.

Open questions

I've used some contrived inlining in ecma-gc.c to produce two copies of ecma_gc_mark (now called ecma_gc_traverse_inner), one with a parameterized "traverse" callback and the other with a pre-baked reference to ecma_gc_set_object_visited. This eliminates any overhead on regular ecma_gc_mark invocations, at the cost of inflating the binary. This may not be the correct trade-off for every user - should it be configurable?
Furthermore, I did this only for ecma_gc_traverse_inner, while its descendent calls are left in their generic form, imposing a marginal overhead on regular GC runs. If this is a problem, I can use the same forced-inline-as-template-metaprogramming technique as with ecma_gc_traverse_inner, or just force-inline the functions as-is (most of them are called only once), either of which should eliminate any runtime impact. Again, the same tradeoff applies of code space & source complexity vs. runtime performance.

What's not included

Serialization

I did not attempt to include any serialization/export functionality for heap snapshots, as there does not appear to be any standard format for doing so. V8 does implement one such format, but as far as I can tell it is undocumented and intended for internal use only.

Instead, I hope the example code provided in the API reference will act as a good starting point for anyone wishing to export the heap snapshot for further processing. An end-to-end example of this can be seen in the heap snapshot unit test, where a python script loads and inspects the heap from a simple flatfile.

Tooling

Partially due to the above point, I did not include any ready-made tools for analyzing heap snapshots once captured. I think this would be a good idea for future work, but I believe the API is useful even on its own... and this PR is large enough as-is!

JerryScript-DCO-1.0-Signed-off-by: Collin Fair [email protected]

Nov 20 '18 23:11 cpfair

@cpfair Thanks for the PR. Please, fix unit test compilation issues first. A red CI may scare away reviewers.

Nov 22 '18 01:11 akosthekiss

I would like to know a bit more about "diagnose memory-usage problems" first.

Nov 22 '18 07:11 zherczeg

I've corrected the errors on CI.

Regarding "diagnose memory-usage problems": The most common use case for this feature would be "I want to reduce memory usage/I just ran out of heap; what changes will help the most?" That is, it's not meant for troubleshooting problems in JerryScript itself, only whatever additional C or JS the developer hooks up for their specific project.

This feature is particularly helpful for anyone who uses JerryScript to host third-party javascript inside their app/device. Here, the third-party developers have no visibility into the inner workings of the engine and are left to guess at which parts of their code is using the most memory. Previously, the best tools the "platform" could expose were the heap totals (used, peak, total). But, these values are noisy due to GC, and still leave the third-party dev to manually bisect their JS until they find the problem. Meanwhile, a heap snapshot can point directly to the variables, functions, prototypes, etc. that are consuming the most memory, eliminating any tedium or guesswork.

Nov 22 '18 20:11 cpfair

My biggest worry is the gc part, a lot of unused arguments are added to functions, and these are not guarded conditionally. Basically we want to add extra functionality to GC, which is never designed for that purpose. I would probably add a separate system for walking through the living objects, which is properly guarded by conditionals. Maintenance burden could be an interesting topic.

Nov 28 '18 09:11 zherczeg

It was due to the issue of maintenance that I decided to generalize the GC logic rather than duplicate it - those functions are already quite long and tedious, so having two copies seemed like a bad plan.

In terms of unused arguments, I'm not exactly sure your concern. If it's in regards to performance, I think it would be better to rely on the compiler to optimize away unused arguments (and supporting code) via inlining, if only to reduce #ifdef clutter. Or, if it's an architectural concern on growing the scope of ecma_gc_*, then the core heap-walking logic could be moved into a separate module that can be used by both GC and heap snapshots. This could either be via a callback (slower) or just #includeing two copies of the implementation with different parameters (faster, bulkier).

Nov 28 '18 15:11 cpfair

I wouldn't trust the compiler too much. Just recently it turned out that:

while (a) {
  switch (b) {
    case X: goto exit
    case Y; do_something;
    case Z; do_something;
  }
}
exit:

is faster than

while (a) {
  if (b == X) {
    break;
  }
  switch (b) {
    case Y; do_something;
    case Z; do_something;
  }
}

with a fairly good gcc compiler.

I run your path on our internal measurement system (your api is disabled, I was just curious about the side effect of the patch):

Benchmark	Perf (sec)
3d-cube.js	0.807 -> 0.811 : -0.504%
3d-raytrace.js	1.037 -> 1.039 : -0.273%
access-binary-trees.js	0.560 -> 0.579 : -3.452%
access-fannkuch.js	2.095 -> 2.101 : -0.300%
access-nbody.js	1.087 -> 1.097 : -0.895%
bitops-3bit-bits-in-byte.js	0.506 -> 0.508 : -0.301%
bitops-bits-in-byte.js	0.665 -> 0.664 : +0.060%
bitops-bitwise-and.js	0.926 -> 0.924 : +0.210%
bitops-nsieve-bits.js	1.132 -> 1.130 : +0.135%
controlflow-recursive.js	0.366 -> 0.366 : -0.111%
crypto-aes.js	0.878 -> 0.882 : -0.463%
crypto-md5.js	0.600 -> 0.598 : +0.177%
crypto-sha1.js	0.595 -> 0.591 : +0.680%
date-format-tofte.js	0.748 -> 0.748 : -0.006%
date-format-xparb.js	0.526 -> 0.528 : -0.469%
math-cordic.js	1.174 -> 1.172 : +0.211%
math-partial-sums.js	0.737 -> 0.736 : +0.220%
math-spectral-norm.js	0.541 -> 0.540 : +0.137%
string-base64.js	1.407 -> 1.417 : -0.727%
string-fasta.js	1.341 -> 1.348 : -0.502%
Geometric mean:	-0.305%

Binary sizes (bytes) 22b08518c7:132664 5270d0119e:132664

Looks like the change has a measurable effect.

Nov 29 '18 12:11 zherczeg

In the case of inlining functions to eliminate dead code and unused args, I have checked and confirmed the desired behaviour on GCC 7.3.1 when cross-compiling to ARM + any optimization above -O0. Not a guarantee that it happens on every version of every compiler, of course.

In terms of the performance metrics you posted: I cannot find out how to run those locally, but I would hazard a guess that any slowdown is due to the tradeoff I mentioned in the original PR under "Open Questions." I have updated this PR to take the more code space/more performance option, which will presumably resolve the slowdown you saw. It does further complicate ecma-gc.c, however.

If either of these points are still an issue, I think the next step would be to switch to using the preprocessor to parameterize the heap-walking components and instantiate a copy for the GC (which would be identical to the pre-change behaviour) and, optionally, the heap snapshot code. It's ugly, so I'm hoping there is some other option I haven't thought of to avoid simply copy-pasting the code.

Nov 29 '18 16:11 cpfair

From technical perspective, I would probably introduce a struct which contains the members required for the traverse, allocate on the stack locally when the api function is called, and add a new global pointer (NULL when not used) to jerry context which points to this structure.

But reading the api it seems we have a bigger issue: Callback which is called at least once for every allocation in the JerryScript heap.

It seems the current code enumerates values accessed by GC. But there are other values, e.g. strings created by an API function and not assigned to a JerryScript object. Hence not all allocations are enumerated. I also feel the too much detail the api provides restricts future GC / memory allocator developments. Also new types will be added (symbol is in progress) which would make the api unstable.

Dec 03 '18 08:12 zherczeg

Any solution that adds indirection to the GC codepath will slow it down. In my measurements this was up to a 25% slowdown in regular GC runs (a result valid only for my MCU, of course). I don't think this is worth it, especially considering that there are options that introduce no runtime overhead.

The code already handles the case you mention, of strings allocated but not referenced: https://github.com/jerryscript-project/jerryscript/pull/2605/files#diff-955437c9f49410a80d622ae3dfc68e97R1281. Of course this is on a piecemeal basis, and as you say, it would require updating for future allocation types.

The API is meant as a debugging tool, so it does indeed expose the full topography of the heap. I can update the documentation to explicitly call out the potential instability of its returns, but I don't think that's a deal-breaker considering the use-case.

Dec 03 '18 15:12 cpfair

25% slowdown is pretty much. Could you share your code?

Dec 05 '18 09:12 zherczeg

The 25% slowdown was measured before I implemented the inlining seen in this PR. One could replicate it by taking this PR and dropping the inline/JERRY_ATTR_ALWAYS_INLINE attributes I added in ecma-gc.c.

Dec 05 '18 15:12 cpfair

From technical perspective, I would probably introduce a struct which contains the members required for the traverse, allocate on the stack locally when the api function is called, and add a new global pointer (NULL when not used) to jerry context which points to this structure.

Could you try this as well? Should have 0 effect when the feature is disabled.

Dec 11 '18 13:12 zherczeg

If I understand it correctly, that suggestion adds runtime overhead due to the additional branching to check for the struct pointer's NULL-ness during the traversal. Branch prediction would mitigate this somewhat, but that's entirely dependent on the processor being used.

Dec 19 '18 19:12 cpfair