thanos icon indicating copy to clipboard operation
thanos copied to clipboard

receive: Incorrect error status code with multiple AlreadyExists

Open phillebaba opened this issue 2 years ago • 13 comments

Thanos, Prometheus and Golang version used:

Thanos: quay.io/thanos/thanos:v0.26.0 Prometheus: quay.io/prometheus/prometheus:2.36.0

Object Storage Provider: Azure

What happened:

I am running a multi-tenant router-receiver setup with multiple Prometheus instances remote writing to it, and a replication factor of 2. At times an error occurs where remote writing fails, and Prometheus will have to resend time series. However when this occurs some time series may already be present in the receiver store and return a AlreadyExistserror. This does not seem to be an issue when a single receiver already has the time series, but a bigger issue when multiple receivers returns the same error. The Prometheus instance will receive a 500 error and try again, forcing it to end up in an infinite loop where it will never be able to recover from.

What you expected to happen:

I expect the correct error to be returned for this condition so that the Prometheus instance will stop sending the same time series which already exist in the receiver store.

How to reproduce it (as minimally and precisely as possible):

This error will only occur when receivers are being torn down and setup again one by one. While this is being done some replications may fail causing the Prometheus instance to resend the same data. It is hard to pin point more conditions to get this error to happen, as sometimes everything works out without any issues. When the error occurs every single Prometheus instance will fail to write as they will continue to attempt sending the same metrics over and over again.

Full logs to relevant components:

level=error ts=2022-06-07T14:16:31.548131665Z caller=handler.go:373 component=receive component=receive-handler tenant=*** err="2 errors: replicate write request for endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict; replicate write request for endpoint thanos-ingestor-receiver-5.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: quorum not reached: forwarding request to endpoint thanos-ingestor-receiver-0.thanos-ingestor-receiver.monitor.svc.cluster.local:10901: rpc error: code = AlreadyExists desc = store locally for endpoint thanos-ingestor-receiver-0.monitor.svc.cluster.local:10901: conflict" msg="internal server error"

Anything else we need to know:

I have done some analysis of what is occurring and my guess is that the determineWriteErrorCause function has a bug in it where it does not detect the conflict error and return it. https://github.com/thanos-io/thanos/blob/17c576472d80972bfd3705e1e0a08e6f8da8e04b/pkg/receive/handler.go#L730

Looking at the handler code it is apparent that if a wrapped error is returned a 500 status code will be given instead of a 409 which is the correct one for conflict errors. https://github.com/thanos-io/thanos/blob/17c576472d80972bfd3705e1e0a08e6f8da8e04b/pkg/receive/handler.go#L361-L375

Reading through the replication code it seems like the original AlreadyExists error is being wrapped two or three times into other errors. When this is done the function to determine if the error is a conflict will not work and instead it will return the whole error causing a 500 status. https://github.com/thanos-io/thanos/blob/22aefd6cc666d0b509ed3748d6f719b60952cf6b/pkg/receive/handler.go#L484 https://github.com/thanos-io/thanos/blob/22aefd6cc666d0b509ed3748d6f719b60952cf6b/pkg/receive/handler.go#L582 https://github.com/thanos-io/thanos/blob/22aefd6cc666d0b509ed3748d6f719b60952cf6b/pkg/receive/handler.go#L665

We need to write a unit test specifically replicating the AlreadyExists error to verify that determineWriteErrorCause will return a conflict error.

phillebaba avatar Jun 07 '22 16:06 phillebaba

I think I have found the origin of this issue. As the comment in determineWriteErrorCause states it will not go deeper than the first multi error to determine the cause. The issue is that there are places where a returned multi error is flattened and places where it is not. If the multi error is not flattened the cause will never be determined. Resulting in a 500 status code.

This error is flattened with a comment explaining why it is. https://github.com/thanos-io/thanos/blob/22aefd6cc666d0b509ed3748d6f719b60952cf6b/pkg/receive/handler.go#L510-L513

In the case where replicate is called in fan out it is not flattened, meaning that there is a risk that a multi error is placed inside of a multi error. https://github.com/thanos-io/thanos/blob/22aefd6cc666d0b509ed3748d6f719b60952cf6b/pkg/receive/handler.go#L475-L490

phillebaba avatar Jun 07 '22 17:06 phillebaba

I really think that there is something weird going on when the replication factor is greater than one. Decreasing the replication factor temporarily gets all Prometheus agents out of their bad loop, and then it can be increased again. When decreasing the replication factor it starts returning 409 errors instead which gets them to backoff and decrease their shards.

The issue with testing this is that it is difficult to reproduce the error properly.

phillebaba avatar Jun 08 '22 14:06 phillebaba

I think we might be able to catch this with a good enough test. For clarification, do you only see this when running separate receivers and ingesters? You could run the same multi-tenant setup without having a split.

fpetkovski avatar Jun 09 '22 05:06 fpetkovski

This can also be a bug in how we verify quorum during replication: https://github.com/thanos-io/thanos/blob/main/pkg/receive/handler.go#L438

The actual formula should be

(ReplicationFactor + 1) / 2

So with replication factor of 2, it should be sufficient for just one write to succeed.

That is also what is described in the design doc: https://thanos.io/tip/proposals-done/201812-thanos-remote-receive.md/

fpetkovski avatar Jun 09 '22 09:06 fpetkovski

It might take some time to setup a new environment without the ingestor split. I will try to get some time to write a unit test to reproduce this.

phillebaba avatar Jun 14 '22 11:06 phillebaba

Please ask if you need further information, we're experiencing this issue in a routing receive setup and obviously use multiple receive for scaling reasons, which makes applying the workaround quite hard.

Tontaube avatar Jun 28 '22 06:06 Tontaube

@Tontaube are you seeing the same issue with a replication factor of 1? I think that is the interesting detail, which will help confirm the idea that the issue is related to the replication factor logic.

phillebaba avatar Jun 29 '22 14:06 phillebaba

Can confirm, we see the same issue with a replication factor of 1

Kampe avatar Jul 03 '22 01:07 Kampe

I cannot tell, as our environment recovered due to solved other issues. We saw the problem always then, when data came in more than 2 hours too late. Reducing the receive instances was not tried, on one hand due to the fact it being a gitops only environment (no manual adaptions in prod) and the assumption, that one receiver would not be capable to handle the load, anyway.

Tontaube avatar Jul 12 '22 12:07 Tontaube

I am not 100% sure but maybe I am strugling same issue on my HA setup with 3x receivers , 2x replica : thanos, version 0.27.0 (branch: HEAD, revision: f68bb366d18557a5347e6b38013f363ad5be05a4) build user: root@708554d4fa46 build date: 20220705-14:57:42 go version: go1.17.11 platform: linux/amd64 https://github.com/thanos-io/thanos/discussions/5513

enimath avatar Aug 12 '22 11:08 enimath

Hey @phillebaba,

I revisited this and I think you're totally right, we are unable to properly determine error cause if there is a multi-error within a multi-error. This can happen, especially with replication factor 2, which you mention is your case. Looking at this portion of the method:

https://github.com/thanos-io/thanos/blob/9237dd9f33fc4b316b2cbd0f1f57a87c36386ade/pkg/receive/handler.go#L1126-L1132

we see that the method returns a "cause" error (line 1129) if threshold is reached or it returns the original error (line 1132).

What I'm suspecting is that setting threshold to write quorum can cause this unwanted multi-error nesting. I considered two different scenarios:

  • with replication factor 2, the write quorum (threshold) is 2. This means that during a fanout, if 1 request succeeds and 1 request fails (what seems to be your case), we will never reach threshold. This means that even if error from the one failing request is a conflict error, we will actually return the multi-error (causing the "multi-error in multi-error" problem and resulting in 500).

  • with replication factor 3 (which for example uses our team), the write quorum (threshold) is still 2. Things can go a few ways: a) If 1 request fails, we still reached quorum, so there will be no error returned to Prometheus client b) if 2 requests fail with conflict error, we will reach the threshold and return conflict error = there will be no nested multi-error c) if 2 requests fail with different causes errors, we will not reach threshold, returning multi-error = we will return 500 to Prometheus client and it will retry

This might explain why this issue is not very prevalent with (I believe) more commonly used quorums like 3 or 5, where we usually reach the threshold. However, with quorum 2, if 1 request fails, this automatically means we always nest multi-error = we always return 500 to client = infinite retries.

I hope this is making sense. If these assumptions are correct, I see two ways of fixing this: a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error b) As @fpetkovski was suggesting, our write quorum formula might not be as was originally intended - if we switched to (n + 1) / 2, this would make difference for even number replication factors, since with replication factor 2, we would only need 1 successful write and threshold during fanout error would be 1 as well, meaning we would avoid the nested multi-error (this would be my preference)

matej-g avatar Sep 16 '22 15:09 matej-g

I can absolutely believe that there are some subtle issues in receive error handling. We've brought up discussions around this in the past and I fear that we haven't been completely rigorous in our assessment of receive errors :/// xref: https://github.com/thanos-io/thanos/pull/3833#discussion_r582692050, https://github.com/thanos-io/thanos/pull/2899#pullrequestreview-449132049

just to add some historical context, the behavior of error handling was changed when we changed multi-error libraries and how we flattened/nested them. this behavior entirely changed how errors are counted and thus how quorum is validated. This challenge has been further highlighted in the past by speed optimizations to the error handling path that may have simplified error handling for speed reasons but maybe did not account for all of the edge-cases. That's not to say that the goals of correctness and performance are incompatible.

@matej-g I think your explanation sounds very very feasible however I'm worried that the we'll continue to play whackamole with receiver error issues if we aren't completely rigorous and exhaustive in how we analyze all of the potential error scenarios. I guess overall I'm mostly worried that we will implement heuristics for simplicity instead of being completely exhaustive.

a) We remove the threshold and simply return either a the most frequent expected cause error or if none are found, we return the original error

I'm somewhat confused by this; could you clarify it? The most frequent error cause isn't necessarily the failure we care about bubbling up to the top and thus not necessarily what should dictate the final error code, right?

I think we would be very well served to add explicit test cases for the scenarios you highlighted in handler_test.go so we can document the correct behavior in tests.

squat avatar Sep 16 '22 16:09 squat

Anyone working on this? I am also facing same issue.

SushilSanjayBhile avatar Mar 26 '24 16:03 SushilSanjayBhile