operations
operations copied to clipboard
Mismatch in cdn and backend traffic
I had a look at traffic incoming to the render servers and traffic leaving fastly headed to the backend. There's a significant mismatch.
Upper is backend, lower is CDN.
sum(rate(apache_sent_kilobytes_total{instance=~"(albi|bowser|necrosan|odin|pyrene|rhaegal|scorch|ysera)"}[$__rate_interval]))
and sum(rate(fastly_rt_origin_fetch_resp_header_bytes_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))/1000 + sum(rate(fastly_rt_origin_fetch_resp_body_bytes_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))/1000
are graphs.
A similar difference is seen with the TPS
Some requests are still coming in from the old CDN, but at most that is 100/s.
I know it's possible that there are users bypassing the CDN and directly going to the backend, but I would be surprised if there's enough for 1.5k TPS.
I did some digging on requests, looking at Fastly origin fetches (sum(rate(fastly_rt_origin_fetches_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))
), Apache accesses (sum(rate(apache_accesses_total{instance=~"(balerion|bowser|culebre|nidhogg|odin|pyrene|ysera)"}[$__rate_interval]))
), and modtile HTTP responses (sum(rate(modtile_http_response_total{instance=~"(balerion|bowser|culebre|nidhogg|odin|pyrene|ysera)"}[$__rate_interval]))
)
apache requests = 2730 TPS modtile responses = 2500 TPS.
This indicates significant non-tile traffic. Checking the logs for the same interval, there's 200 TPS getting 429 error codes for per-IP ratelimiting. This traffic wouldn't be reaching mod_tile, so explains that difference.
At the same time, fastly origin fetches = 1800.
Breaking down the mod_tile responses by status code gives me info. HTTP 200 = 1750 HTTP 304 = 750 HTTP 404 = 0.1
It looks like fastly_rt_origin_fetches_total is only retrievals of tiles, not conditional gets with a 304. At the same time, there's still about 50TPS unaccounted for there. Some of that is going to be fastly health checks, but that's 1 request every 15 seconds per datacenter, or about 4TPS.
I also looked at other fastly metrics, and there's fastly_rt_origin_revalidations_total which is about 50 TPS, so that's probably it
However, I'm confused, because looking at the fastly RT API docs, origin_revalidations is supposed to be 304 status codes
So to summarize,
modtile responses = 200s + 304s + 404s + 5xxs apache accesses = 200s + 304s + 404s + 429s+ 5xxs fastly origin fetches + fastly origin revalidations = 200s