cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Adjust rate-limit usage on partial ingestion failure

Open bboreham opened this issue 4 years ago • 7 comments

After #3825 (since reverted), any failure to ingest samples will cause the rate-limit reservation to be canceled. However it is quite possible in Cortex that some samples were accepted and some rejected; we can only send one result back to the caller so we send an error.

I think we would need a special gRPC error object to carry back the count of succeeded/failed samples to get this more accurate.

bboreham avatar Mar 01 '21 14:03 bboreham

You're right. What if we cancel only in case the error is a httpgrpc 5xx error or a non-httpgrpc error?

pracucci avatar Mar 01 '21 15:03 pracucci

Right now I have some samples rejected due to being over the limit on series per metric, and the same user being over rate-limit, so #3825 (which we haven't rolled out yet) should improve matters.

Your suggestion would then make things worse, in this particular case.

bboreham avatar Mar 01 '21 15:03 bboreham

A simple and minor improvement would be to only roll-back if all the samples fail to ingest.

This would still help with the original issue of ingesters being unavailable, but prevent a single bad sample from circumventing the rate limit (reverting to existing behavior).

stevesg avatar Mar 01 '21 18:03 stevesg

How would we know that?

bboreham avatar Mar 01 '21 18:03 bboreham

Good question, scratch that...

stevesg avatar Mar 01 '21 18:03 stevesg

Your suggestion would then make things worse, in this particular case.

My suggestion was to cancel the rate-limiter reservation only in the case the distributor returns a 5xx, which means the client will retry it, regardless some samples have been ingested or not. I understand it's not as accurate as you propose (count the exact number of samples ingested), but may be a good compromise to solve the original issue which was the case when 2+ ingesters are unhealthy.

pracucci avatar Mar 02 '21 08:03 pracucci

My point is that I don't want to report two errors when there is only one. I understand that your suggestion solves your issue, but this is my issue.

bboreham avatar Mar 02 '21 09:03 bboreham