cortex
cortex copied to clipboard
HA Tracker KV Store usage not rate-limited
Describe the bug When the HA tracker calls checkKVStore and calls into the Consul client's CAS method (code), it is not using a rate limiter or backoff. Contrast that to calls to WatchPrefix, which has a rate limiter.
Digging into the CAS method in the client, it defaults to no backoff (code).
sleepBeforeRetry := time.Duration(0)
if c.cfg.CasRetryDelay > 0 {
sleepBeforeRetry = time.Duration(rand.Int63n(c.cfg.CasRetryDelay.Nanoseconds()))
}
There is a config for backoff, but it's for tests only:
// Used in tests only.
MaxCasRetries int `yaml:"-"`
CasRetryDelay time.Duration `yaml:"-"`
The result, which happened to us recently during a brief Consul impairment, is that the Cortex Distributor spams Consul thousands of times per minute during the downtime (only for the ha-tracker key), such that even when Consul recovers, it has blocked the Distributor with 429 errors, which makes the Distributor continue spamming and reinforcing the ban. We only recovered after the distributors took a total outage, which allowed Consul to clear its ban on the Distributors.
Here are some example logs for one instance; note how close together they are in time. There was a single minute in which a single distributor logged 28,248 of these events.
2022-03-30T08:20:07.360463896Z stderr F level=error ts=2022-03-30T08:20:07.360394895Z caller=client.go:154 msg="error getting key" key=ha-tracker/fake/monitoring/kube-prometheus-stack-prometheus err="Unexpected response code: 429"
2022-03-30T08:20:07.360379895Z stderr F level=error ts=2022-03-30T08:20:07.360331594Z caller=client.go:154 msg="error getting key" key=ha-tracker/fake/monitoring/kube-prometheus-stack-prometheus err="Unexpected response code: 429"
2022-03-30T08:20:07.360317594Z stderr F level=error ts=2022-03-30T08:20:07.360264593Z caller=client.go:154 msg="error getting key" key=ha-tracker/fake/monitoring/kube-prometheus-stack-prometheus err="Unexpected response code: 429"
2022-03-30T08:20:07.360258693Z stderr F level=error ts=2022-03-30T08:20:07.360194893Z caller=client.go:154 msg="error getting key" key=ha-tracker/fake/monitoring/kube-prometheus-stack-prometheus err="Unexpected response code: 429"
To Reproduce Steps to reproduce the behavior:
- Start Cortex 1.11
- Impair the Consul KV store
- watch the HA tracker retry behavior
Expected behavior When there is an impairment on the KV store, the distributor should regularly interact with it to check for recovery, but should implement either rate limiting or backoff to prevent excessive spamming. This also applies to situations in which the KV store is returning 429 errors--Cortex should retry with backoff in order to allow server-side rate limiting of the KV store to clear.
Environment:
- Infrastructure: Kubernetes
- Deployment tool: ArgoCD/Helm
Storage Engine
- [x ] Blocks
- [ ] Chunks
Additional Context
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.