cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Distributor instrumentation reports "success" to requests returning 4xx when called via HTTPgRPC

Open pracucci opened this issue 3 years ago • 5 comments

Describe the bug In the distributor, the metric cortex_request_duration_seconds is counting 4xx as status_code="success" when the distributor Push() return an httpgrpc error with a 4xx code inside.

From a quick look, I believe the problem is in the httpgrpc server where only 5xx are converted into httpgprc errors: https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/httpgrpc/server/server.go#L67-L69

The grpc instrumentation then report everything which is not an error as success: https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/middleware/grpc_instrumentation.go#L14-L26

Expected behavior To be reported as status_code=4xx.

pracucci avatar Mar 12 '21 17:03 pracucci

@bboreham What's the sentiment if we would change https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/httpgrpc/server/server.go#L67-L69 to return an error also in case of a 4xx?

pracucci avatar Mar 12 '21 17:03 pracucci

I don't use httpgrpc for calls into the distributor, so don't care what you do in that regard. I guess we'd need to think about what happens in the places httpgrpc is used unconditionally - does query-worker go through this path?

Would it cause increased logging?

bboreham avatar Mar 12 '21 17:03 bboreham

This was changed in https://github.com/cortexproject/cortex/commit/f807690176ec98508a03c5ba29cbd8740c309ebd#diff-e58545b15a7caeb6b1290e4a6f2f41825141a8bfe028afe1736cfe9d2e4ac4f5R69, copied across to https://github.com/weaveworks/common/pull/117/files#diff-3e32bcfdd79c52ac49dbb25b172d9e2d3010cec4078f6f1748e4d588114d974dR69

So I guess this had something to do with the desired behaviour on retries. We should figure out what that was before changing it back.

bboreham avatar Sep 23 '21 14:09 bboreham

Another way of doing this would be to modify gRPC instrumentation at https://github.com/weaveworks/common/blob/bd288de53d57de300fa286688ce2fc935687213f/middleware/grpc_instrumentation.go#L14-L26 and also inspect the response we’re sending back. If it is *httpgrpc.HTTPResponse, then we could extract status code from it, and report that in metrics. (Introducing an interface for getting status code may be a better option)

pstibrany avatar May 06 '22 07:05 pstibrany

Note what I said here:

copied across to https://github.com/weaveworks/common/pull/117/files#diff-3e32bcfdd79c52ac49dbb25b172d9e2d3010cec4078f6f1748e4d588114d974dR69 was a misunderstanding. That err was always nil.

That came in at https://github.com/weaveworks/common/pull/36, which was a fix to https://github.com/weaveworks/common/issues/35. That now has an explanation, which I believe was outdated by weaveworks/common#40.

bboreham avatar May 15 '22 20:05 bboreham