compliantkubernetes-apps
compliantkubernetes-apps copied to clipboard
Investigation: Thanos failover
What should be investigated. As the troubleshooting session revealed if a single Receive fails Thanos cannot ingest new metrics, even if another replica is ready. We should investigate if we can make these component more redundant. Especially given how dependant we are on these to fully monitor an environment.
On a related node, we only run Query Frontend and Receive Distributor as single replicas, we should consider running these in HA to ensure quick failover. (And it would improve the situation for Thanos Rule since it queries the frontend.)
What artifacts should this produce. Report back on what should be possible to implement and if there are any consideration for doing it.
As the troubleshooting session revealed if a single Receive fails Thanos cannot ingest new metrics, even if another replica is ready.
Please refresh my mind here, what is the point of us running two replicas if this is the case?
-
Seems like the replication factor is currently broken when using separate routing and ingesting receivers, but it is fixed in newer versions of the chart.
-
The replication factor requires that the metrics can be sent to a quorum of the nodes matching on the hashring, requiring us to use atleast three replicas if we want any availability benefits.
I'll look at the impact this will have on resource usage but I assume that we would need larger nodes if we want to fit that in.
Current benefit of two replicas with a replication factor of one is that it should spread out the load over all ingesting receivers.
Surprisingly the memory usage of my receivers have been lower in total compared to the previous setup, although the network usage of all instances have been a lot higher and the cpu usage somewhat higher.
I'd suggest waiting a bit further until we have an updated chart where we can set replication on the distributor since this seems to lower usage somewhat. (This had to be patched on the current one.)