operations icon indicating copy to clipboard operation
operations copied to clipboard

Mismatch in cdn and backend traffic

Open pnorman opened this issue 3 years ago • 1 comments

I had a look at traffic incoming to the render servers and traffic leaving fastly headed to the backend. There's a significant mismatch.

image Upper is backend, lower is CDN.

sum(rate(apache_sent_kilobytes_total{instance=~"(albi|bowser|necrosan|odin|pyrene|rhaegal|scorch|ysera)"}[$__rate_interval])) and sum(rate(fastly_rt_origin_fetch_resp_header_bytes_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))/1000 + sum(rate(fastly_rt_origin_fetch_resp_body_bytes_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))/1000 are graphs.

A similar difference is seen with the TPS

image

Some requests are still coming in from the old CDN, but at most that is 100/s.

I know it's possible that there are users bypassing the CDN and directly going to the backend, but I would be surprised if there's enough for 1.5k TPS.

pnorman avatar Jan 02 '21 19:01 pnorman

I did some digging on requests, looking at Fastly origin fetches (sum(rate(fastly_rt_origin_fetches_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))), Apache accesses (sum(rate(apache_accesses_total{instance=~"(balerion|bowser|culebre|nidhogg|odin|pyrene|ysera)"}[$__rate_interval]))), and modtile HTTP responses (sum(rate(modtile_http_response_total{instance=~"(balerion|bowser|culebre|nidhogg|odin|pyrene|ysera)"}[$__rate_interval])))

apache requests = 2730 TPS modtile responses = 2500 TPS.

This indicates significant non-tile traffic. Checking the logs for the same interval, there's 200 TPS getting 429 error codes for per-IP ratelimiting. This traffic wouldn't be reaching mod_tile, so explains that difference.

At the same time, fastly origin fetches = 1800.

Breaking down the mod_tile responses by status code gives me info. HTTP 200 = 1750 HTTP 304 = 750 HTTP 404 = 0.1

It looks like fastly_rt_origin_fetches_total is only retrievals of tiles, not conditional gets with a 304. At the same time, there's still about 50TPS unaccounted for there. Some of that is going to be fastly health checks, but that's 1 request every 15 seconds per datacenter, or about 4TPS.

I also looked at other fastly metrics, and there's fastly_rt_origin_revalidations_total which is about 50 TPS, so that's probably it

However, I'm confused, because looking at the fastly RT API docs, origin_revalidations is supposed to be 304 status codes

So to summarize,

modtile responses = 200s + 304s + 404s + 5xxs apache accesses = 200s + 304s + 404s + 429s+ 5xxs fastly origin fetches + fastly origin revalidations = 200s

pnorman avatar Jul 09 '22 06:07 pnorman