cortex Retrying AlertManager UnaryPath Requests on next replica if one fail

What this PR does:

AlertManager unary APIs are failing due a single alert manager fail as distributor picks a random one to forward the request.

This change only pick the replicas alert managers in case of a error.

Which issue(s) this PR fixes: Fixes #

Checklist

[X] Tests updated
[ ] Documentation added
[X] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Aug 13 '21 17:08 alanprot

Thanks for working on this. I can add some context:

The problem with creating silences is that the operation is not idempotent in [upstream] alertmanager. Each API call creates a new silence with a new UUID, which are then replicated to the peers. Adding retries mean you could end up with multiple silences being created from a single API call. e.g. The API call to the first replica is started but doesn't return (e.g. maybe a timeout is hit). That silence is created and the replicated around. If we retry, then the calls to the second/third replica will continue and create duplicates. It's arguable what behavior is better, ideally, we would add an API to upstream alertmanager allowing idempotent silence creation, but whilst this is certainly achievable, it's non-trivial.

The same problem exists with deletion, it is not idempotent (it returns 4xx if the ID is not found), so you could delete from one replica, try the second, and it has already been deleted due to replication, so you get a 4xx error. This is not such a big problem as for creation.

With regards to /status and /receivers, I don't see why they can't be fixed up, though I don't think we need to retry but can just use a quorum read and pick the first result that comes back (Use the merger.Noop{}). In an ideal world the distributor would be transparent, just performing quorum operations with different result merging.

Aug 19 '21 07:08 stevesg

Hi @stevesg

Yeah.. i was indeed thinking about the Put Silence... I will see if there is something we can do there.

About the delete i think is fine right? If i get a 4xx on the first and retry in the second, it should be fine. I can make it not retry on 4xx if needed, but i think should be fine here.

For /status and /receivers is still a good thing to do retry them right? Those APIs are returning 5xx on every deployment.

Aug 19 '21 17:08 alanprot

My over-arching point/question was whether we should aim to use quorum operations for everything, so the distributor is logically as transparent as possible (modulo result merging).

About the delete i think is fine right? If i get a 4xx on the first and retry in the second, it should be fine. I can make it not retry on 4xx if needed, but i think should be fine here.

Agreed we can probably make delete work with upstream as-is, preserving the alertmanager API semantic of returning 404 is the annoying bit. Wondering if we should just do quorum write here too? This would make deletes be synchronously replicated. The resulting merging could be along the lines of: if all 404, then 404 else if all 200 or 404 then 200 else error.

For /status and /receivers is still a good thing to do retry them right? Those APIs are returning 5xx on every deployment.

Absolutely, I was suggesting that you could use the existing quorum mechanism to save adding complexity.

Aug 23 '21 08:08 stevesg

For adding a silence, could we change the API so that the caller makes up a UUID for the request and passes that in, rather than the alertmanager making one and returning it?

Aug 23 '21 17:08 bboreham

Hi @bboreham and @stevesg .. Thanks for the inputs.

if all 404, then 404 else if all 200 or 404 then 200 else error.

Will it be tricky to do this using DoBatch? As far as i know DoBatch will return the last error as soon as it reach the maxFailures.

@bboreham The upstream alertmanager API? I like the idea of making the caller send the UUID, but this seems to be a breaking change, no? I was thinking that maybe we could pass a "UUID Generator" on the Options and use it here - So we could consistently set the same UUID on all the replicas. What do you think? The problem with that is we would always generate the same UUID for the same "body content" I guess.

Besides that, can we change this PR to address the read/delete path and create a new one/issue to address the Put Silence?

Aug 23 '21 20:08 alanprot

Yes I did mean change the API, but we could have it fall back to the existing behaviour if no UUID supplied, for backwards-compatibility.

Besides that, can we change this PR to address the read/delete path and create a new one/issue to address the Put Silence?

Fine.

Aug 26 '21 19:08 bboreham

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Nov 25 '21 09:11 stale[bot]

Still relevant..

Nov 25 '21 19:11 alanprot

Closing as this is being worked on https://github.com/cortexproject/cortex/pull/4840

Aug 31 '22 17:08 alanprot

cortex cortex copied to clipboard

Retrying AlertManager UnaryPath Requests on next replica if one fail

cortex
cortex copied to clipboard