trafficstars

Garbage collection

I would like to take a quick pass at nailing down garbage collection behavior more thoroughly:

[x] Understanding the Lua/LuaJIT GC design
[ ] Profiling the behavior in practice
[ ] Tuning the available parameters
[ ] Writing down some "do's and dont's" tips for Snabb hackers.

(Historically we have not dug deeply into this topic because using default settings and optimizing inner loops to avoid allocations has been basically okay. However, coming into the era of timeline logs I am keen to have more direct visibility of this important topic. I especially want to be able to account for any GC work separately from other interruptions like interrupts, context switches, page faults, system calls, etc.)

Background

Garbage collectors are all different. Each design has strengths and weaknesses. It's a valuable exercise to understand the GC so that you can enjoy the strengths while steering clear of the weaknesses.

Snabb is a peculiar application. We are very sensitive to GC latency, since we like to consistently perform our packet processing in ~100us, but we are not very GC sensitive to throughput, because our code tends to allocate very little garbage. Our Lua heap is also relatively small because we allocate most major data structures as FFI structures in C memory.

Thankfully our requirements seem well aligned with Lua. For example, people have used Lua for hard real-time robotics seem to have been able to limit the worst-case GC latency to some tens of microseconds. But the devil is in the details and so we should examine this ourselves.

Understanding Lua/LuaJIT GC design

LuaJIT seems to mostly inherit the Lua 5.1 garbage collector. On first glance it seems reasonable to characterize this collector as incremental, mark-and-sweep, non-generational, and tunable.

Incremental means that the garbage collector works in small steps. Each time you allocate some data you also spend some cycles on moving the garbage collector forwards through its garbage collection cycle state machine. (Contrast with a stop-the-world collector that runs the whole garbage collection cycles at once.)

Mark-and-sweep means that the GC first marks every object that is reachable and then sweeps to deallocate each dead object. So the overall cost of a garbage collection would seem to be proportional to the total heap size i.e. the number of both live and dead objects, since each one will be either marked or swept. (Contrast with a copying collector that only visits live objects.)

Non-generational means that on each cycle the GC will visit every object in the heap. So the overall cost of a GC is proportional to the total size of the heap. (Contrast with a generational collector that sometimes runs shorter cycles to quickly GC recently-allocated objects that are were short lived.)

Tunable means that the collectgarbage() API functions allows applications to control the GC in interesting ways. You can stop, restart, or force collection. You can tune pause and step multiplier values to control when to start a new cycle and how aggressively to push through it. You can also get a count of the current heap size to help make manual decisions.

Profiling the GC behavior in practice

TODO: Initial idea is to use the timeline log and the Lua GC API both to count the number of bytes heap-allocated in each step of the engine and also to defer all GC to the end of the breath where it can be timed in isolation from other code.

Tuning the available parameters

TODO: Initial idea is to use the step multiplier parameter to slow down the rate of GC cycles. For example to allow 100MB of garbage to accumulate before a new GC cycle is started. This may reduce the rate at which the GC needs to traverse and mark live objects, without impacting latency thanks to incremental operation, but at the expense of allowing garbage to consume more RAM.

Write down some "do's and dont's" for Snabb hackers

TODO: Summarize some recommendations for how to think about GC and also how to measure it.

Apr 06 '17 12:04 lukego

Here is an early screenshot from a heap-allocation profiler that I cooked up. (Click to zoom.)

rplot09

This is with the snabbmark basic1 benchmark. Data is from a timeline log that is augmented to stamp the size of the Lua heap on each event so that the precise amount of allocation at each step is visible. The graph suggests that each processing step usually allocates 0 bytes but that occasionally they do allocate, most strikingly the engine.polled_timers event that often shows ~50 bytes of allocation. Overall quite nice to see so little allocation.

Apr 07 '17 22:04 lukego

Early days but I am quite impressed with the Lua/LuaJIT GC. I have made an experimental patch to defer all GC until the end of a breath. This is mostly intended to make it more measurable: see all GC time in one place and don't let it get mixed in with the processing time of apps. (If we care about apps allocating memory then we can measure that separately, like on the diagram immediately above.)

I ran snabbmark basic1 to process 10 billion packets and took 31K samples of GC latency. I am being conservative and assuming a slow 2GHz clock when converting from cycles to microseconds. I think it is easiest to summarize the quantiles numerically:

     10%      25%      50%      90%      99%    99.9%   99.99%     100% 
 0.11300  0.12200  0.25200  1.55600  6.08188 10.87198 79.61105 81.73600

This says that the median (50%) GC latency was 0.25us and that in 99.9% of breaths the GC latency was <11us. The worst case in the sample is ~80us.

This looks fairly respectable to me. I would like to do a little more due diligence to make sure that these results are valid, and to compare them with a more substantial application like snabbnfv, and I would also like to understand the reason for the bad cases taking ~80us in case this can be improved (e.g. maybe it's due to an unnecessary memory-remapping system call, that would be nice...)

Can also be interesting to visually compare the total duration of app callbacks when GC is deferred until later (red) vs when it is allowed to happen anytime (blue). Click to zoom:

callbacks-gc

Here we see a bimodal pattern for the duration of each callback, but for Sink the deferred garbage collection is making the callback time more predictable and more consistently <10us. This seems like a nice property for "divide and conquer" because it will be easier to investigate the remaining outliers if we know that they are not related to GC.

(The durations are actually much more consistent than they look in this visualization... nearly all of the data points are down the bottom in that thick ink... if you really care then here is the ECDF for the same data.)

Apr 08 '17 13:04 lukego

snabb snabb copied to clipboard