trafficserver
trafficserver copied to clipboard
High cache disk write
I'm trying to set up an ATS reverse proxy cache in front of our services. I noticed that ATS writes a huge amount of data to the disk even after requesting a resource that results TCP_REFRESH_HIT
cache state. Looks like it rewrites the whole cache storage.
Environment
- Ubuntu 22.04 based docker image
- ATS 9.1.2
- 100 GB cache storage file (same thing happens when using a raw disk)
Attached the access log, only health check endpoint was requested before/after index.html
id=firewall time="2022-08-03 10:27:03" fw=8e51c4744445 pri=6 proto=http duration=0.000 sent=198 rcvd=106 src=127.0.0.1 dst=0 dstname=localhost user=- op=GET arg="_ats_hc" result=200 ref="-" agent="curl/7.81.0" cache=TCP_MISS
id=firewall time="2022-08-03 10:27:33" fw=8e51c4744445 pri=6 proto=http duration=0.000 sent=198 rcvd=106 src=127.0.0.1 dst=0 dstname=localhost user=- op=GET arg="_ats_hc" result=200 ref="-" agent="curl/7.81.0" cache=TCP_MISS
id=firewall time="2022-08-03 10:27:48" fw=8e51c4744445 pri=6 proto=http duration=0.080 sent=2261 rcvd=225 src=172.22.18.3 dst=[x.x.x.x] dstname=[origin] user=- op=GET arg="index.html" result=200 ref="-" agent="curl/7.68.0" cache=TCP_REFRESH_HIT
id=firewall time="2022-08-03 10:28:03" fw=8e51c4744445 pri=6 proto=http duration=0.000 sent=198 rcvd=106 src=127.0.0.1 dst=0 dstname=localhost user=- op=GET arg="_ats_hc" result=200 ref="-" agent="curl/7.81.0" cache=TCP_MISS
id=firewall time="2022-08-03 10:28:33" fw=8e51c4744445 pri=6 proto=http duration=0.000 sent=198 rcvd=106 src=127.0.0.1 dst=0 dstname=localhost user=- op=GET arg="_ats_hc" result=200 ref="-" agent="curl/7.81.0" cache=TCP_MISS
The cache only contained index.html
, however seems like it consumes extra 8192 bytes from the cache storage after each request. (there is no other alternate in the cache)
"proxy.process.cache.read_per_sec": "0.000000",
"proxy.process.cache.write_per_sec": "2.000400",
"proxy.process.cache.KB_read_per_sec": "0.000000",
"proxy.process.cache.KB_write_per_sec": "4096.819336",
"proxy.process.cache.bytes_used": "16384",
"proxy.process.cache.bytes_total": "107240243200",
"proxy.process.cache.ram_cache.total_bytes": "133906432",
"proxy.process.cache.ram_cache.bytes_used": "8192",
"proxy.process.cache.ram_cache.hits": "1",
"proxy.process.cache.ram_cache.misses": "1",
"proxy.process.cache.volume_0.bytes_used": "16384",
"proxy.process.cache.volume_0.bytes_total": "107240243200",
"proxy.process.cache.volume_0.ram_cache.total_bytes": "133906432",
"proxy.process.cache.volume_0.ram_cache.bytes_used": "8192",
"proxy.process.cache.volume_0.ram_cache.hits": "1",
"proxy.process.cache.volume_0.ram_cache.misses": "1",
Disk write starts ~30-60 sec later
Is this intended or something wrong with the environment / configuration?
Thanks
@baricsz Can you post the headers from the origin response?
@bryancall These are the headers from the origin
id=firewall time="2022-08-10 06:22:49" fw=1c41cf083ce4 pri=6 proto=http duration=0.046 sent=2261 rcvd=225 src=172.22.18.3 dst=[x.x.x.x] dstname=[origin] user=- op=GET arg="index.html" result=200 ref="-" agent="curl/7.68.0" cache=TCP_REFRESH_HIT origin_headers="{{{Server}:{nginx}}{{Date}:{Wed, 10 Aug 2022 06:22:49 GMT}}{{Connection}:{keep-alive}}{{Access-Control-Allow-Origin}:{*}}{{Access-Control-Allow-Methods}:{POST, GET, OPTIONS, DELETE, PUT, HEAD}}{{Access-Control-Max-Age}:{3600}}{{Access-Control-Allow-Headers}:{x-requested-with, authorization, Content-Type, Authorization, credential, X-XSRF-TOKEN, if-modified-since, if-none-match, x-webhook-apikey}}{{X-Content-Type-Options}:{nosniff}}{{X-XSS-Protection}:{1; mode=block}}{{Cache-Control}:{no-cache, must-revalidate, public}}{{Last-Modified}:{Thu, 04 Aug 2022 07:53:09 GMT}}{{Strict-Transport-Security}:{max-age=31536000; includeSubDomains; preload}}{{Referrer-Policy}:{strict-origin-when-cross-origin}}}"
Server: nginx
Date: Wed, 10 Aug 2022 06:22:49 GMT
Connection: keep-alive
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST, GET, OPTIONS, DELETE, PUT, HEAD
Access-Control-Max-Age: 3600
Access-Control-Allow-Headers: x-requested-with, authorization, Content-Type, Authorization, credential, X-XSRF-TOKEN, if-modified-since, if-none-match, x-webhook-apikey
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Cache-Control: no-cache, must-revalidate, public
Last-Modified: Thu, 04 Aug 2022 07:53:09 GMT
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Referrer-Policy: strict-origin-when-cross-origin
Also, proxy.config.http.cache.ignore_server_no_cache INT 1
is set in the records.conf file.
@bryancall Do you have any thoughts on this?
Cache-Control: no-cache, must-revalidate, public
--> the origin is telling ATS that it can't serve from cache without revalidation from origin each time. I'd suggest asking them to remove no-cache (plus really think through must-revalidate for the momentary stale case), and add a max-age directive.
I think the cache control is correct, at least this specific resource must be revalidated each time.
My only concern is that even if the resource has been revalidated and the origin server tells that the resource is still valid and there is nothing to download (304 Not modified, no response body), ATS consumes an extra 8192 bytes block from the cache and also writes a huge amount of data to the disk.
Origin access log
161.35.24.62 - - [20/Sep/2022:06:26:40 +0000] "GET /path/x.js HTTP/1.1" 304 0 "https://x.y.z/path/index.html" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36" "x.x.x.x, x.x.x.y" rt=0.018 uct=0.000 uht=0.017 urt=0.017
More precisely, it writes around 500 MB to the cache storage after this specific request and I'm wondering if this is normal and won't affect server performance under high load?
The extra storage for each hit is expected. This is because
- As far as ATS is concerned all of the cache storage is in use all the time. It is 100% full at all times because it is treated as a circular buffer.
- Each cache object has a "first doc" which contains metadata for the object (and sometimes the object also, if it's small). However, nothing in the cache is ever updated. What appears to be an update is really a new write to the cache. E.g. if the object is in cache and the metadata needs to be updated, a new copy of the metadata ("first doc") is written to the cache. This is very likely what you see as the "extra storage" used.
I also suspect the 500MB write to cache storage is stripe directory synchronization. Each stripe of cache storage has a directory which locates objects in the cache. Two copies of this are written to disk. The cache architecture documentation might be of interest, particular storage layout. To recover from crashes with minimal cache update loss, the stripe directory is written out every minute or so. Unfortunately this configuration variable does not appear to be documented.
proxy.config.cache.dir.sync_frequency
- time in seconds between stripe directory writes. Default is 60 (seconds). I will try to update the documentation.
Thanks for the detailed explanation, much appreciated!