prometheus mixin: remote-write related alert severity should take HA setup into account

Currently, the PrometheusRemoteStorageFailures and PrometheusRemoteWriteBehind alerts are critical. However, especially with remote-write setups, many users will run HA pairs (or groups) of Prometheus servers, and the remote-write receiver will have some way of dedup'ing the incoming samples. If that's the case, just one Prometheus replica having trouble with remote-write should just be a warning. The alert should be critical only if all members of the HA group have trouble.

Apr 27 '20 15:04 beorn7

However, thinking about it, the current way how Cortex handles HA pairs will actually not switch the replica if one falls behind…

Nov 02 '20 14:11 beorn7

Hello from the bug scrub, is there progress on this issue @beorn7 ? Otherwise we'll close it next time around.

Apr 23 '24 11:04 krajorama

I think alert severity is highly debatable, not only here but in several parts of our mixins. Some might say that if one replica is completely down, the HA setup is compromised and someone should be paged as a precaution. Others might say that data is still being ingested and it's safe to keep it like this for some time, no need to page.

What I wanted to highlight here is that alert severity is highly opinionated, and hard to find a one-fits-all solution 😬

Apr 23 '24 11:04 ArthurSens

I noticed no complaint on the current state in the last 3.5y. So let's close for now. If anyone feels the need to revisit, they can follow-up here or open an new issue and we'll take it from there.

Apr 30 '24 16:04 beorn7