trafficserver Fast log buffer

This adds a configurable fast mode to log objects that use per-thread log buffers rather than having all threads contend for per-object buffers. While this is much faster, it does not support ordered log entries. This makes it situationally useful, but if a user has binary logs or doesn't need ordered log output, this could allow them to log many more transactions. Its configurable either globally via records.config or per-log via logging.yaml.

Sep 15 '22 16:09 cmcfarlen

What performance difference does your benchmark show? How about a cut-n-paste of the output?

Sep 20 '22 14:09 ywkaras

benchmark name                       samples       iterations    estimated
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
logobject fast                                 100             1     50.7197 s
                                        497.339 ms    482.904 ms    502.998 ms
                                        43.2047 ms    20.2149 ms    91.1906 ms

logobject slow                                 100             1     11.9278 m
                                         5.55437 s     5.43973 s     5.68358 s
                                        619.759 ms    508.693 ms    748.277 ms

Here is the benchmark output. The fast log buffers are about 10x faster. The performance improvement greatly depends on the performance of the CAS instruction, which on intel is abysmal under high contention. For AMD the perf difference is far less dramatic. As an anecdotal difference, we were able to run full logging (without sampling) without a noticeable drop in requests per second and fairly instantly filled a 10gig RAM disk with "fast" log files. Before this change, enabling logging would dramatically reduce our requests/second.

Also, I have to modify other parts of the code to run the benchmark since the JemallocNodumpAllocator creates (and leaks) thread local arenas, it cannot support as many thread starts as the benchmark spawns, but using the ink_freelist_* functions just adds another CAS bottleneck which ruins everything. I've included a commit that replaces all of that with a straight malloc Allocator class in case you want to try to reproduce this, but I was going to add that as an optional allocator in a separate PR. I'd like to remove that from this PR before any potential merging of this code (assuming it's acceptable).

Sep 20 '22 21:09 cmcfarlen

Has this been tested in production at all?

Sep 21 '22 18:09 ywkaras

This has only been tested in lab environments

Sep 21 '22 19:09 cmcfarlen

Looks good except for nitpicks.

Sep 21 '22 19:09 ywkaras

[approve ci]

Sep 23 '22 18:09 ywkaras

trafficserver trafficserver copied to clipboard

Fast log buffer

trafficserver
trafficserver copied to clipboard