Querier tracks cortex_request_duration_seconds_count tracks status_code=503 on "context canceled"
I'm investigating the alert MimirRequestErrors which fired for a short period on a Mimir cluster because the querier tracked cortex_request_duration_seconds_count metric with status_code="503" and route="prometheus_api_v1_query_range".
Looking at querier logs I can see the rate of logs with context canceled matching the rate of requests tracked with status code 503. An example of querier log:
msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = context canceled"
This make me suspect that the querier is incorrectly tracking cortex_request_duration_seconds_count with status_code="503" when the request is actually canceled (at least its context is canceled).
To double check it, I've compared the request failures reported by query-frontend and querier. I can't see any 5xx in the query-frontend:

The issue comes from the error mapping done in Prometheus API, specifically here: https://github.com/prometheus/prometheus/blob/ae597cac62425f0d3017860af6be5f65666c9e93/web/api/v1/api.go#L1602-L1603
When the context is canceled while running the PromQL engine then the HTTP response status code from querier to query-frontend is 503, and thus tracked as 503 in the querier's cortex_request_duration_seconds_count.
I see 3 options to solve this problem (in order of my preference):
- Easy: change the status code in Prometheus (proposed here)
- Easy: if (1) is discarded, change the status code in our Prometheus fork
- Harder: do a re-mapping in Mimir. Reason why this option is harder is because in Mimir we just reuse the Prometheus API, so we would have to do a remapping in a HTTP middleware and (a) detect the actual cause of the error and (b) do the actual remapping of the HTTP response status code is harder than just changing it in Prometheus
Easy: change the status code in Prometheus
Change has been merged by Prometheus (thanks!). So I'm proceeding updating the upstream Prometheus into our fork (PR) and then will update the vendored Prometheus in Mimir.