mimir
mimir copied to clipboard
Remote-write: allow all errors to be returned
Is your feature request related to a problem? Please describe.
Suppose a remote-write Push request comes in, containing 100 series, and 5 of them cause an error (over limit, label too long, etc.). Currently the distributor will just return one of those errors. At least one user has expressed a preference to get all errors, so they can report back to the source and ensure they are all tracked down.
Describe the solution you'd like
An option (per-tenant? per-call?) to receive a set of errors instead of just one. Probably this is an extension to the Prometheus remote-write proto format.
Describe alternatives you've considered
Change nothing: if the workload is being sliced at random, there is good probability that over time a different 100 series will be sent in each call and you can see all the errors across a set of calls. (But if it's not random this doesn't help)
I think it would make sense to do it per-call, but instead of adding the option to the remote-write proto maybe we could just do it as header? That would still allow it to be configured in Prometheus but reflect the fact that this is more of a Mimir specific thing.
I think I was referring to the response when I said "Probably this is an extension to the Prometheus remote-write proto format.".
I am concerned that reporting all errors could be very computationally expensive, e.g. if a call contains 500 series and 500 of them error.
After reflection I think something like #2420 is a more suitable solution for the requirement to report issues back to different sources. It would be a quantative report rather than precise errors, but the potential cost is more contained.
@bboreham
I am concerned that reporting all errors could be very computationally expensive, e.g. if a call contains 500 series and 500 of them error.
Aren't there a finite number of error codes? With that finite number being ~10 (e.g. 'out-of-order', 'sample-too-old', etc etc)?
What if we returned all the error codes in a given push and a subset (no more than 2 or 3) of the datapoints that were rejected for this reason?
Does this reduce the computational cost?