mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Remote-write: allow all errors to be returned

Open bboreham opened this issue 2 years ago • 4 comments

Is your feature request related to a problem? Please describe.

Suppose a remote-write Push request comes in, containing 100 series, and 5 of them cause an error (over limit, label too long, etc.). Currently the distributor will just return one of those errors. At least one user has expressed a preference to get all errors, so they can report back to the source and ensure they are all tracked down.

Describe the solution you'd like

An option (per-tenant? per-call?) to receive a set of errors instead of just one. Probably this is an extension to the Prometheus remote-write proto format.

Describe alternatives you've considered

Change nothing: if the workload is being sliced at random, there is good probability that over time a different 100 series will be sent in each call and you can see all the errors across a set of calls. (But if it's not random this doesn't help)

bboreham avatar Jul 11 '22 15:07 bboreham

I think it would make sense to do it per-call, but instead of adding the option to the remote-write proto maybe we could just do it as header? That would still allow it to be configured in Prometheus but reflect the fact that this is more of a Mimir specific thing.

LeviHarrison avatar Aug 02 '22 20:08 LeviHarrison

I think I was referring to the response when I said "Probably this is an extension to the Prometheus remote-write proto format.".

bboreham avatar Aug 08 '22 14:08 bboreham

I am concerned that reporting all errors could be very computationally expensive, e.g. if a call contains 500 series and 500 of them error.

After reflection I think something like #2420 is a more suitable solution for the requirement to report issues back to different sources. It would be a quantative report rather than precise errors, but the potential cost is more contained.

bboreham avatar Aug 08 '22 14:08 bboreham

@bboreham

I am concerned that reporting all errors could be very computationally expensive, e.g. if a call contains 500 series and 500 of them error.

Aren't there a finite number of error codes? With that finite number being ~10 (e.g. 'out-of-order', 'sample-too-old', etc etc)?

What if we returned all the error codes in a given push and a subset (no more than 2 or 3) of the datapoints that were rejected for this reason?

Does this reduce the computational cost?

09jvilla avatar Aug 09 '22 17:08 09jvilla