Make object expiry scalable
TL;DR: This patch series solves scalability issues of the single threaded object expiry by offloading object core de-referencing to worker threads.
The test scenario
To test the SLASH/ storage engines, I regularly run "burn in" tests on two reasonably big machines (500gb RAM, 48 cores, TBs of NVMes). The most important test tries to stress the most problematic cases all the time: objects are either very short-lived or very long lived, have random size and random ttl, basically there are only cache misses, and "way too many" passes, bans and purges. It has evolved quite a bit and been an invaluable tool to find all kinds of issues, mostly races, before users do.
The issue
The test has exposed an issue where the number of objects goes through the roof as soon as LRU nuking starts, ultimately leading to an out-of-memory situation (there have also been other relevant issues and most prominently one memory leak, but the issue addressed here was nevertheless real and root-caused in Varnish-Cache. See here for more details)
The problem is illustrated by this graph of VSCs gathered during a test run - NB: the y axis has log scale!:
Plotted here are various VSCs, either directly or as a derivative Requests Per Second rps(x). FELLOW.fellow.g_mem_obj is a counter which gets atomically updated by the storage engine, giving a reliable number of objects actually alive in the memory cache. The other counters are all Varnish-Cache standard.
We would expect MAIN.n_object to always match FELLOW.fellow.g_mem_obj more or less, but as we can see, n_object goes from roughly 2^20 (~1M) to roughly 2^24 (~16M) and there is no indication for it to decrease. It just stopped growing when the process died.
We can also see how the n_expired rate is pretty stable at first (and due to the random but uniform nature of the test we can assume that the number of objects expiring actually is stable in relation to the number of objects added), but some time after nuking kicks in (black n_lru_nuked) graph, not only goes the number of objects through the roof, but also the n_expired rate almost comes to a complete stop (again, this is log scale, 2^5 is, as we all know, 32).
A suspicion
The immediate suspicion was that this issue could be related to the fact that the EXP thread is single threaded, but in retrospect, I would like to first give some ...
Background on the EXP thread
The exp thread uses a binheap (very interesting background read for those who do not know it) to keep a logical list of object core references by expire time. Consequently, all cacheable objects need to go though the EXP thread by design: When they enter the cache, EXP needs to register them on the binheap, until they either get removed externally or expire. So, the expire thread is a natural bottleneck. While we could support multiple expire threads, it is not the actual maintenance of the binheap which is expensive, but most of all object deletion as a consequence of expiry or removal
This PR
Herein I suggest to make a number of changes to make a single EXP thread scale up significantly:
Restructure the code to batch operations on the inbox and expired objects
Rather than handling one inbox / expired object at a time, we collect them in an array and lists, and then work these outside the expire lock.
Make sure that both removal and expiry are always handled
Bbefore, removal had precedence over expiry, which could lead to the object pileup described before
Offload de-referencing of object cores
We offload the actual de-referencing (which, most of the time, implies object deletion) to worker threads, which implicitly provides backpressure by using up system resources and worker threads.
All details are documented in the individual commit messages
Potential follow up: hcb_cleaner has the same problem in principle, but not sure yet how relevant, will continue testing.
My first thought, and this is not a replacement for this PR, is that it was a mistake to have a central EXP, and that really should be STV methods.
I think all the places which call EXP() also know the stevedore, so that shouldn't be too much havoc, and it would give STV's the option of implementing EXP integrated with their storage layout policy.
I agree with the previous comment, multi-EXP would be helpful and would allow for specific optimizations. I see this as largely orthogonal to this PR, as we still need an EXP implementation for SML, to which this PR implements improvements, and also other stevedore implementations will likely start by copying the default implementation.