cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Don't log "repeated timestamp but different value" message as error/warning

Open weeco opened this issue 6 years ago • 11 comments

The problem:

When you receive multiple values for the same metric series at the same timestamp this will be rejected with a 400 error. This is a desired behaviour, however it also logs every occurance as error log (in the distributor) and at the warn level in the ingester.

Warn and error logs should only indicate problems. Under certain circumstances you cannot fix the root cause of this (fixing the reported metrics) and therefore this can quickly become a common scenario. However I'd still like to use the occurences of log messages at the warn/error level as metric which indicate problems which need attention.

Proposal:

The log message shows the affect metric series, which certainly helps in tracking the issue. Therefore I suggest to log these events at the info level rather than error or warn.

weeco avatar Jun 27 '19 13:06 weeco

Hi @weeco , I'd like to work on this issue.

jayapriya90 avatar Oct 20 '19 13:10 jayapriya90

Hey @jayapriya90 Please go ahead!

So I don't have a good solution to the problem tbh, but the flow is like this.

The distributor calls Push API on the ingester, which returns an error and now in the distributor we log all the errors returned.

The error returned by the ingester: https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L422

https://github.com/cortexproject/cortex/blob/master/pkg/distributor/distributor.go#L373-L390 and https://github.com/cortexproject/cortex/blob/master/pkg/distributor/http_server.go#L38-L48

Where we handle that error.

Though as the error is literally a string, I am not sure of a good way to handle it.

gouthamve avatar Oct 20 '19 13:10 gouthamve

We could put errors like this in WriteResponse instead of error, and extend the distributor to understand such responses and send back an error to the caller without logging it.

bboreham avatar Nov 12 '19 12:11 bboreham

There is some work in https://github.com/weaveworks/common/pull/195 to rate-limit "high volume errors", ~however as it stands it would not affect 400 errors.~ (not directly relevant as it's for http and these are gRPC errors).

bboreham avatar Aug 20 '20 15:08 bboreham

To update some of the earlier analysis since code may have moved:

The error logged by the distributor comes from here: https://github.com/cortexproject/cortex/blob/8587ea61fe17d71da529917c9e48b548f4dbe8ef/pkg/util/push/push.go#L47 and looks like this:

level=error ts=2020-09-24T04:06:45.73264618Z caller=push.go:47 org_id=redacted trace_id=64b5e086737cda17 msg="push error" err="rpc error: code = Code(400) desc = user=redacted: sample with repeated timestamp but different value; last value: 284390.4, incoming value: 52708147.3 for series {__name__=\"envoy_http_downstream_cx_length_ms_sum\",redacted}"

That error has been decoded by gRPC so we can have as much detail as we wish, although it seems to me that by this point in the code we know it's an error sent by ingester (as opposed to a network timeout, say, which we would fail without logging a few lines above).

The error logged by the ingester comes from here: https://github.com/cortexproject/cortex/blob/8587ea61fe17d71da529917c9e48b548f4dbe8ef/vendor/github.com/weaveworks/common/middleware/grpc_logging.go#L35-L39 and looks like this:

level=warn ts=2020-09-24T04:11:41.138368379Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=297.688µs err="rpc error: code = Code(400) desc = user=redacted: sample with repeated timestamp but different value; last value: 195435, incoming value: 956673 for series {__name__=\"envoy_http_downstream_rq_time_count\", redacted}" msg="gRPC\n"

Seems we could distinguish InvalidArgument here same as we do Canceled.

bboreham avatar Sep 24 '20 08:09 bboreham

I'm assuming this hasn't been fixed since this is still open, but just to clarify do you still need someone to take care of this? If so I would like to dip my toes in and see if I can help out

dmares01 avatar Jun 28 '21 21:06 dmares01

This issue is still open. What's your proposal to fix it?

pracucci avatar Jun 30 '21 15:06 pracucci

I do not have a proposal yet. My team just started using Cortex the past week and this was something we noticed. I figured if someone else wasn't already working on it that I would try to find a solution to it

dmares01 avatar Jun 30 '21 16:06 dmares01

@dmares01 by all means take a look. We don't think anyone is actively looking at this issue.

bboreham avatar Jul 09 '21 10:07 bboreham

@bboreham 👋🏼

I was taking a look at this issue and trying to understand what needs to be done. I think I understand what you said above but had a few foggy areas that I was hoping you could shed some light on:

The logged ingester warning from: https://github.com/cortexproject/cortex/blob/8587ea61fe17d71da529917c9e48b548f4dbe8ef/vendor/github.com/weaveworks/common/middleware/grpc_logging.go#L35-L39

can be handled by implementing IsInvalidArgument similar to grpcUtils.IsCanceled, and if IsInvalidArgument returns true, we log at the Info level. But, this would mean that all statuses of InvalidArgument would then be logged at the info level - which I don't think is desired?

The logged distributor error: https://github.com/cortexproject/cortex/blob/8587ea61fe17d71da529917c9e48b548f4dbe8ef/pkg/util/push/push.go#L47

similar question as the above for this as well, if we distinguish the error in some form, we would still be excluding other errors of that form.

I see that the error is constructed from https://github.com/cortexproject/cortex/blob/6408117fedb206dcbf12f09e1720ff0cc4cef104/pkg/ingester/errors.go#L34-L41

One possible solution could be if we can maybe extract the errorType somehow as https://github.com/cortexproject/cortex/blob/6408117fedb206dcbf12f09e1720ff0cc4cef104/pkg/ingester/series.go#L18

but this would mean exporting types, and I'm not sure if that's the best way to go about it. wdyt?

I might've completely misunderstood the approach mentioned by you as well, please lmk if that's the case :)

MadhavJivrajani avatar Jan 29 '22 12:01 MadhavJivrajani

@MadhavJivrajani sounds good so far. However Cortex depends on weaveworks/common; we can't export a string the other way round.

Maybe Cortex could pass in a filtering function (via an extension to weaveworks/common) ?

bboreham avatar Feb 07 '22 12:02 bboreham