envoy
envoy copied to clipboard
Make trie structure less memory-hogging
Commit Message: Make trie structure less memory-hogging Additional Description: Previous trie structure uses approximately 8kb per unique character of prefix; in a production environment with about 27000 route prefixes, this amounted to consuming approximately 1.5GB of additional RAM. This version of a trie reduces that memory impact by a factor of over 100, at a cost of a small increase in insert-time and a very small increase in lookup-time (still O(1)). Increase in operations are more than paid for by smaller memory footprint = less memory cache destruction, for tables larger than 2000 characters (e.g. 20 entries 100 characters long, 100 entries 20 characters long, etc.) Trie used for static headers was replaced in #33932 to avoid being subject to the performance impact of this change. Risk Level: Small. Existing tests still pass, additional tests verify some new edge-case also behaves fine, performance impact is insignificant (a bit worse in cases where it was already very fast, better in cases where it was less fast). Testing: Added extra unit tests. Docs Changes: n/a Release Notes: n/a Platform Specific Features: n/a
Benchmark below was generated with both implementations being tested side-by-side. PR now removes the old 'big' implementation as there is no longer any call for it.
New benchmark (in PR) results, running as bazel run -c opt test/common/common:utility_trie_speed_test -- --benchmark-repetitions=10 --benchmark_out_format=csv --benchmark_out=somefile.csv:
| Benchmark | BigMean | SmallMean | Difference (>100% means BigTrieEntry was faster) |
|---|---|---|---|
| bmTrieLookupsRequestHeaders | 29.7503 | 45.5528 | 153% |
| bmTrieLookupsResponseHeaders | 21.1678 | 41.4313 | 195% |
| bmTrieLookups/10/Mixed | 280.286 | 303.578 | 108% |
| bmTrieLookups/100/Mixed | 419.074 | 351.339 | 83% |
| bmTrieLookups/1000/Mixed | 1282.69 | 576.878 | 44% |
| bmTrieLookups/10000/Mixed | 1376.65 | 768.84 | 55% |
| bmTrieLookups/10/8 | 4.71941 | 12.7108 | 269% |
| bmTrieLookups/100/8 | 9.93909 | 17.8588 | 179% |
| bmTrieLookups/1000/8 | 19.2406 | 29.3718 | 152% |
| bmTrieLookups/10000/8 | 37.3927 | 34.1951 | 91% |
| bmTrieLookups/10/128 | 511.97 | 1018.77 | 198% |
| bmTrieLookups/100/128 | 1734.64 | 749.413 | 43% |
| bmTrieLookups/1000/128 | 2927.7 | 1083.53 | 37% |
| bmTrieLookups/10000/128 | 4017.93 | 1319.94 | 32% |
Thanks for starting this! I know that previous attempts to update this data-structure saw some actual time (not the O notation) increase, which seemed to be high for the header map structure. IIRC there are some header-map benchmarks. Can you run them as well?
Also adding @jmarantz as he has good suggestions in this area. /assign @jmarantz @adisuissa
What do you think about adding a benchmark?
/wait
What do you think about adding a benchmark?
I added the output of two existing benchmarks to the PR description (one of which didn't work before or after the change). The first one seems likely to be what you were looking for, and looks like it's ~a wash.
Would you mind formatting the benchmark results in table/spreadsheet? One thing that struck me on a. quick scan is that headerMapImplIterate/50 got noticeably worse, and that (I assume) would be used whenever we serialize headers to upstreams and downstreams. I realize this is a bit of toil but it's not too bad if you use --benchmark_format=csv; then you can import that into a spreadsheet.
How can we reason about whether this is a net win?
Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)
I also want to point out (not having looked at code yet) that I would guess your change may offer a meaningful reduction in memory during operation.
I think that most (if not all) of the tests in the benchmark do not really exercise the trie. AFAIK the trie is used to quickly access the O(1) headers. Consider adding new targeted benchmark tests that exercise accessing different paths (for example, short and long keys).
Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.
I'll have a go at fixing that other benchmark and maybe adding more and putting together a side-by-side comparison (--benchmark_format=csv is a good tip I didn't know about, thanks!)
It'll take a while because this is mostly a side-task when I'm waiting for other things to build/test, and my current main task doesn't have very long pauses.
Thinking one possibility if it turns out this is a noticeable additional time-cost for headers might be to use distinct TrieEntry templates for headers vs routes, because the memory cost of the headers trie is ~fixed and relatively small and (maybe) highly-trafficked, vs. the memory cost of the routes one is, well, up to 1.5GB in a real-world example and most likely only experiences one lookup per request.
+1 Changing the header-map has a bigger impact radius, so having 2 trie implementations seems safer.
Also should we fix the other benchmark in a separate PR so it can inform us of the impact? I'm sure there was a reason you picked that benchmark :)
I had a go, and fixed the thrown exception, but there's just another segfault after that and I don't really know what that benchmark is even trying to achieve, so I'm abandoning that and making an issue for it (https://github.com/envoyproxy/envoy/issues/33794) in case someone else has an interest. I'll just make a standalone benchmark for trie performance.
I'm not entirely sure we should split out the header-map trie. Can we first see what's suboptimal about your trie for those header-map case and try to fix that? Maybe there's a way to have a single good trie implementation.
Failing that, we could split into 2 alternate trie implementations optmiized for different purposes, and well documented as to which new code should choose, depending on whether they are more like header-map or more like route.
In either case, can we have a new benchmark which shows the improvements you have made?
New benchmark (results in the PR description) suggests the new version is faster for everything but lookups of a pretty small number of small keys - I'm guessing because it makes for fewer cache-misses.
Can you show the benchmark results as a table where there's a column for 'before' and a column for 'after'? Like I said I know it's a bit of toil but not too bad if you use csv.
DId you do anything to make the Iterate benchmark faster in the new version rather than slower, which it was earlier? Maybe we are getting some inconsistent behavior on your machine. You can use --benchmark_xxx arguments of varous sorts (try --help) to run more iterations and get statistically significant results.
Whenever doing these optimizations I do think it pays to run enough iterations to increase confidence.
/wait
Benchmarks side-by-sided. Bit weird that headerMapImplIterate is the one benchmark with concerning numbers, but as far as I can tell it doesn't even do anything involving a trie during the benchmark's measured code.
This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.
One other minor request: can you add a percent-improvement column, and with negative percents when things get slower? It might be easier to see at a glance what got slower vs faster.
You might also want to use some more advanced --benchmark_xxx options to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.
/wait
This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.
Funny, you were against that solution previously. :) Done it this way because I tend to agree, though I think if the static header map grew just a little bit it would actually be faster to use the smaller trie for that too (and in real-world use that might already be the case because memory-cache-smashing impact in an isolated benchmark may not reflect memory-cache-smashing real-world performance impact.)
You might also want to use some more advanced
--benchmark_xxxoptions to do a series of interleaved tests and report the std-deviation. It's just different options to pass to the test binary.
Made it a mean over 10 executions for the new benchmark; the std-dev was consistently small enough to not really be worth reporting. The old benchmark is essentially irrelevant now since the change isn't touching header behavior; I added one column to it with the same execution parameters to make sure it was like the "old" column (i.e. that the sugar-only changes were indeed sugar-only), which it is.
This looks good for the routes. I am not entirely convinced that we aren't making things a bit worse for header-maps and thus shoiuld have 2 alternate impls with doc explaining when you'd pick which one.
Funny, you were against that solution previously. :)
FWIW, it was I who was for introducing this in 2 steps (reducing blast radius). That said, if there's no impact (or positive impact) for the headers case, then I'd be ok with updating the trie for both.
My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.
But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.
My preference is to iterate on the performance for the header-map and use this new impl always. If you think we can get there, I'm all for it.
Yeah, I don't think there's a viable way to get the performance of FastTrieEntry with the implementation of SmallTrieEntry at the scale of the header static lookup table. (I do think it would be possible to get the header static lookup table to be faster and smaller than a trie, but that would be a separate PR since the point here was more to improve the size, and, it turns out, also performance, of the route-prefix trie.)
But I just had a hard time interpreting the data shown as proving it was always at least as fast as it was prviously.
As the benchmark now shows, it is not as fast for small datasets. I guess it might be useful to specifically run the benchmark with a dataset that is the common set of static headers. Not sure where that's defined.
OK I will look at the current state. header map performance is important so I have a hard time suggesitng changes that might make someone's critical path slower.
Maybe not if it fits in a cache line?
On Tue, Apr 30, 2024, 12:34 PM Raven Black @.***> wrote:
@.**** commented on this pull request.
In source/common/common/utility.h https://github.com/envoyproxy/envoy/pull/33668#discussion_r1585174941:
Value value_{};
- std::array<std::unique_ptr<TrieEntry>, 256> entries_;
+private:
- std::array<std::unique_ptr<FastTrieEntry>, 256> entries_; +};
+// A SmallTrieEntry aims to be a good balance of performant and +// space-efficient, by allocating a vector the size of the range of children +// the node contains. This should be good for most use-cases. +// +// For example, a node with children 'a' and 'z' will contain a vector of +// size 26, containing two values and 24 nulls. A node with only one
Lookups through a btree would be multiple steps. O(log(num_keys)) is surely worse than O(1) when the 1 is also very simple.
— Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/pull/33668#discussion_r1585174941, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO2IPMJVV5SAXPH4EDVPLTY77B2NAVCNFSM6AAAAABGN5RT4OVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZRHE3TAMZYGQ . You are receiving this because you were assigned.Message ID: @.***>
Maybe not if it fits in a cache line?
The vast majority of vector-based nodes are going to contain vectors of size 1, which (even with the overhead of also containing a start point and the vector's own overhead) would probably still end up smaller than an absl::btree of size 1. I guess I could give it a try after fixing this other stuff.
Performance tested a btree_map-based TrieEntry; it's similar to the stretching vector, but slightly slower throughout (and significantly slower at the small end).
I think it would probably make the most sense to land #33915 first, then #33932, then this one can be modified back to only having one version and one layer of template (and naming it back to TrieEntry/TrieLookupTable) because BigTrieEntry will no longer have any customers. That will also make this PR not touch so many files, since all the customers of TrieLookupTable other than the static header map will be able (and appropriate) to use the new one with no touch at the client end.
I think it would probably make the most sense to land #33915 first, then #33932, [...]
Added bonus of this approach, it will fix the msan/tsan problem of OOM-crashing when running the benchmark, because there will no longer be a BigTrieEntry construct to compare, which is the one causing the OOM, which is essentially the thing this PR was intended to fix in the first place. :)
Now that the static header map search isn't a TrieLookupTable, this PR is much simpler because it's back to not needing to support two implementations (and also no longer needs to touch any client-code).
Ready for review again.
/retest
/retest
/retest