cortex
cortex copied to clipboard
Distributor instrumentation reports "success" to requests returning 4xx when called via HTTPgRPC
Describe the bug
In the distributor, the metric cortex_request_duration_seconds
is counting 4xx as status_code="success"
when the distributor Push()
return an httpgrpc error with a 4xx code inside.
From a quick look, I believe the problem is in the httpgrpc server where only 5xx are converted into httpgprc errors: https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/httpgrpc/server/server.go#L67-L69
The grpc instrumentation then report everything which is not an error as success
:
https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/middleware/grpc_instrumentation.go#L14-L26
Expected behavior
To be reported as status_code=4xx
.
@bboreham What's the sentiment if we would change https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/httpgrpc/server/server.go#L67-L69 to return an error also in case of a 4xx?
I don't use httpgrpc for calls into the distributor, so don't care what you do in that regard. I guess we'd need to think about what happens in the places httpgrpc is used unconditionally - does query-worker go through this path?
Would it cause increased logging?
This was changed in https://github.com/cortexproject/cortex/commit/f807690176ec98508a03c5ba29cbd8740c309ebd#diff-e58545b15a7caeb6b1290e4a6f2f41825141a8bfe028afe1736cfe9d2e4ac4f5R69, copied across to https://github.com/weaveworks/common/pull/117/files#diff-3e32bcfdd79c52ac49dbb25b172d9e2d3010cec4078f6f1748e4d588114d974dR69
So I guess this had something to do with the desired behaviour on retries. We should figure out what that was before changing it back.
Another way of doing this would be to modify gRPC instrumentation at https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/middleware/grpc_instrumentation.go#L14-L26
and also inspect the response we’re sending back. If it is *httpgrpc.HTTPResponse
, then we could extract status code from it, and report that in metrics. (Introducing an interface for getting status code may be a better option)
Note what I said here:
copied across to https://github.com/weaveworks/common/pull/117/files#diff-3e32bcfdd79c52ac49dbb25b172d9e2d3010cec4078f6f1748e4d588114d974dR69 was a misunderstanding. That
err
was always nil.
That came in at https://github.com/weaveworks/common/pull/36, which was a fix to https://github.com/weaveworks/common/issues/35. That now has an explanation, which I believe was outdated by weaveworks/common#40.