tempesta
tempesta copied to clipboard
TDBv0.2: Cache background revalidation and eviction
Depends on https://github.com/tempesta-tech/tempesta/issues/1869
Scope
tfw_cache_mgr
thread must traverse Web-cache and evict stale records on memory pressure or revalidate them otherwise. The thread must be accurately scheduled and throttled to not to impact system performance as well as efficiently free required memory. #500 must be kept in mind as well.
Validation logic is defined by RFC 7234 4.3 and requires implementation of conditional requests.
Keep in mind DoS attack from #520. Following items linked with #516 (TDB v0.3) must be implemented:
- [ ] Revalidate cache entries by specified per-vhost timeout (like S3 lifecycle)
- [ ] TDB tables must be dynamically extensible and should not be strictly power of 2, e.g. 7GB should be fine. See comment https://github.com/tempesta-tech/tempesta/issues/1515#issuecomment-1835064500
- [ ] UPDATE and DELETE operators must be implemented. Probably the lock-free index should be immutable and deletion should be implemented using thumbstones and updates are just copies of data plus thumbstone for the old data.
- [ ] properly implement
reinsert
andlookup & insert
(tdb_rec_get_alloc()
) logic from #1115 (temporary implementatied in #1178). - [ ] Race-free interface for large insertions. E.g.
__cache_add_node()
creates a TDB entry, which immediately becomes visible for other threads, and latertfw_cache_copy_resp()
inserts actual data, so concurrent threads may get incomplete or corrupted data. It can be done in 2 phases (soft updates): (1) allocate space in TDB data area and (2) actual insert (index update) to link the data.tfw_client_obtain()
modifications from #1178, as well as similar HTTP sessions storage (#685), and__cache_add_node()
must be changed to use the soft updates. This also implies some versioning: while a softirq sending data for current cached object (probably very slowly with #391 .1 in mind), the object may stall and/or replaced by a new version, so the new version only must be fetched by new scans while the old version must reside in TDB untill it's fully transmitted and then it should be evicted. - [x] Support/fix constant address placement for small records, see https://github.com/tempesta-tech/tempesta/pull/1178#discussion_r256834550
- [x] Generic items removal. On removal the HTrie must be shrinked. With records locking and/or reference counting, probably thumbstone removal should be implemented.
- [ ] There must be locks or reference counters for the stored entries to not to delete entries being processed (see e.g. #522)
- [ ] custom eviction strategy must be implemented (e.g. Web-cache should register it's callbacks for freshness calculation) such that different tables can use different eviction strategies or no eviction at all. A custom triggers must be supported, e.g. TLS cache should be able to specify maximum number of stored sessions as 50 (see ssl_cache.c).
- [ ] besides creation timestamp for eviction, entries must have minimum and maximum lifecycle honored by the eviction strategy
- [x] number of memset() calls must be reduced.
- [ ] fix for data persistency on clean restart. Introduce non-persistent tables - sessions (#685) and client (#1115) tables should be non-persistent. _Probably for Beta we should go with non-persistent tables only (as for now). We definitely should have a configuration option whether to read the full database into RAM on start or just throw out (or do in background for #516 ) all the data _
- [ ] Web cache data for different vhosts must be stored in different tables to prevent full path collisions and improve concurrency and security (tables separation plus tdbfs user/group access control instead of
chroot
isolation). - [x] ~The current TDB table size maximum is 128GB, which is too small for the web cache on the modern hardware~ This is teh subject for #400
- [ ] At the moment we have very limited number of tables, but we might need to scale to thousands of tables, e.g. for logging #537
- [ ] we need to create Tempesta DB tables in runtime (e.g. to reconfigure a hash table for a bots protection algorithm) to load Tempesta Language #102 scripts in run time.
- [ ] cache tables must be per-vhost to get rid of unnecessary contention and index splitting for different vhosts. Important for the CDN use case. However, large tables still must be supported for single resource cases.
- [ ] Avoid
__cache_entry_size()
call which introduces an extra response traversal. It seems we can just allocate new TDB data blocks and later reuse them if we have extra space or just ignore the tail if it's unusable. - [ ] Consider to send cached content as compound pages, just like high-speed NICs do this (e.g. see discussions in #447)
The task is required to fix #803.
UPD. Since filtering (#731) and QoS (#488) also require eviction, there job should be done in tdb_mgr
thread instead.
UPD. TDB was designed to provide access to stored data in zero-copy fashion, such that cached response body can be sent directly to a socket. This property made several design limitations and introduced many difficulties. However, with TLS we always have to copy data. So TDB design can be significantly simplified with copying. So depends on #634.
Cache eviction
While CART is well known good adaptive replacement algorithm, there are number of caching algorithms based on machine learning, which provide much better cache hit. See for example the survey and Cacheus. Some of the algorithms required access to columnar storage for statistics (common practice in CDNs).
At least some interface for the user space algorithm is required. Probably just CART with some weights, where weights are loaded from the users space into the kernel, would be enough.
The cache must implement per-vhost eviction strategies and space quotas to provide caching QoS for CDN cases. Probably 2-layer quotas are required to not prevent poor configuration issues for bad Vary specification on application side, which may take too much space (linked with #733). Different eviction strategies are required to handle e.g. chunks of live streams (huge data volume, immediately remove outdated chunks) and rarely updated web content like CSS (may service stale entries).
It must be possible to 'lock' some records in evictable data sets (see #858 and #471).
Purging
On this feature implementation we should be able to normally update the site content w/o Tempesta restart or memory leaks. It's hard to track which new pages appeared and which are deleted during site content update, so in this task we need:
- full web content purging;
- regular expression purging, e.g.
/foo/*.php
or/foo/bar/*
- ~
immediate
(purge
in original #501) strategy for the purging (we still need the mode to leave stale responses in the cache for #522);~ Done in #2074
Documentation
Need to update https://github.com/tempesta-tech/tempesta/wiki/Caching-Responses#manual-cache-purging wiki page.
Testing
- [ ] Throughput on large cached objects and compare with Nginx
- [ ] web content purging with
invalidate
andimmediate
strategies - [ ] Test on web cache larger than 4GB in 1 and 2 NUMA nodes with cache modes 1 and 2.
It seems there is some race in the lock-free index or we actually hit the https://github.com/tempesta-tech/tempesta/issues/500 problem in scenario from #1435 : multiple parallel requests to large file
./wrk -d 3600 -c 16000 -t 8 -H 'connection: close' https://debian:443/research/web_acceleration_mechanics.pdf
combined with the Tempesta restart in the VM
# while :; do ./scripts/tempesta.sh --restart; sleep 30; done
sometimes produce warnings like
[ 1103.775556] [tdb] ERROR: out of free space
[ 1103.810415] [tdb] ERROR: out of free space
[ 1103.845177] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1103.929897] [tdb] ERROR: out of free space
[ 1103.949002] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1103.984315] [tdb] ERROR: out of free space
[ 1104.010543] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.070816] [tdb] ERROR: out of free space
[ 1104.080997] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.151540] [tdb] ERROR: out of free space
[ 1104.158845] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
[ 1104.199489] [tdb] ERROR: out of free space
[ 1104.231891] [tdb] ERROR: Cannot allocate cache entry for key=0x37bed983985f3ea7
....
The task must be split. After #788 the most crucial part is removing cache entries for #522 and some basic eviction to get the cache usable, i.e. get rid of the memory leaking.
I've made few roughly benchmarks HTTP2 with enabled caching.
h2load -c700 -m100 --duration=30 -t2 https://debian
Tempesta
1kb response
finished in 30.14s, 337279.80 req/s, 393.06MB/s
requests: 10118394 total, 10188394 started, 10118394 done, 10118394 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 10118394 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 11.52GB (12364696856) total, 1.70GB (1821310920) headers (space savings 23.08%), 9.65GB (10361235456) data
min max mean sd +/- sd
time for request: 391us 404.11ms 69.33ms 52.31ms 64.69%
time for connect: 70.24ms 229.04ms 169.16ms 56.50ms 61.71%
time to 1st byte: 195.61ms 323.51ms 252.20ms 27.06ms 79.96%
req/s : 0.00 4462.36 803.41 771.99 59.29%
5kb response
finished in 30.23s, 229514.40 req/s, 1.14GB/s
requests: 6885532 total, 6955433 started, 6885532 done, 6885432 succeeded, 100 failed, 100 errored, 0 timeout
status codes: 6885469 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 34.16GB (36684160200) total, 1.16GB (1244661614) headers (space savings 23.00%), 32.83GB (35253572326) data
min max mean sd +/- sd
time for request: 17.12ms 698.47ms 103.21ms 39.88ms 90.88%
time for connect: 73.25ms 237.29ms 165.14ms 56.21ms 69.57%
time to 1st byte: 210.69ms 299.74ms 253.76ms 25.23ms 58.53%
req/s : 0.00 603.40 366.27 247.73 69.86%
128kb response
finished in 30.36s, 17200.80 req/s, 2.11GB/s
requests: 516024 total, 586024 started, 516024 done, 516024 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 516273 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 63.24GB (67904607399) total, 90.50MB (94901121) headers (space savings 22.71%), 63.01GB (67651755146) data
min max mean sd +/- sd
time for request: 47.50ms 18.31s 998.10ms 1.12s 95.44%
time for connect: 70.58ms 254.74ms 159.74ms 56.57ms 68.43%
time to 1st byte: 203.41ms 474.57ms 360.97ms 78.33ms 58.21%
req/s : 0.00 181.65 31.60 47.24 77.14%
128kb reponse with HTTP/1
finished in 30.37s, 21665.00 req/s, 2.65GB/s
requests: 649950 total, 719750 started, 649950 done, 649950 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 650181 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 79.52GB (85388074799) total, 142.43MB (149350032) headers (space savings 0.00%), 79.95GB (85844417328) data
min max mean sd +/- sd
time for request: 27.77ms 2.64s 510.89ms 293.07ms 85.45%
time for connect: 76.97ms 210.16ms 152.70ms 47.93ms 69.34%
time to 1st byte: 187.62ms 302.22ms 253.48ms 39.83ms 54.62%
req/s : 0.00 336.64 48.35 78.75 82.86%
Nginx (nginx/1.23.3)
1kb response
finished in 30.15s, 135510.73 req/s, 150.56MB/s
requests: 4065322 total, 4135322 started, 4065322 done, 4065322 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 4065322 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 4.41GB (4736134430) total, 476.87MB (500034606) headers (space savings 33.15%), 3.88GB (4162889728) data
min max mean sd +/- sd
time for request: 1.45ms 1.54s 530.87ms 307.86ms 70.73%
time for connect: 15.54ms 374.44ms 123.50ms 85.68ms 77.57%
time to 1st byte: 179.61ms 909.80ms 359.37ms 165.22ms 86.00%
req/s : 109.97 366.27 193.44 80.16 71.71%
5kb response
finished in 30.16s, 168594.90 req/s, 846.10MB/s
requests: 5057847 total, 5127847 started, 5057847 done, 5057847 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 5065270 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 24.79GB (26616104602) total, 599.00MB (628093480) headers (space savings 32.97%), 24.12GB (25896832020) data
min max mean sd +/- sd
time for request: 359us 5.39s 432.35ms 460.44ms 87.07%
time for connect: 22.18ms 265.32ms 123.70ms 63.49ms 57.29%
time to 1st byte: 219.39ms 2.17s 803.55ms 511.62ms 59.57%
req/s : 55.85 558.71 240.58 163.94 72.29%
128kb response
finished in 30.27s, 16222.27 req/s, 2.05GB/s
requests: 486668 total, 556668 started, 486668 done, 486668 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 548023 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 61.56GB (66099265904) total, 65.85MB (69050898) headers (space savings 32.98%), 61.42GB (65952787645) data
min max mean sd +/- sd
time for request: 21.49ms 29.62s 3.73s 3.07s 71.63%
time for connect: 23.21ms 310.06ms 147.42ms 71.60ms 57.86%
time to 1st byte: 247.08ms 1.68s 754.43ms 418.40ms 52.57%
req/s : 3.10 175.05 23.13 21.80 88.00%
FYI: Sometimes h2load freezes at the end of benchmarking tempesta. Looks like tempesta holds connection.