mimir icon indicating copy to clipboard operation
mimir copied to clipboard

Gap in Read and Write when HA Prometheus replica changes

Open andreimiclea99 opened this issue 1 year ago • 4 comments

Describe the bug

Every time the Prometheus replica changes in the HA tracker it leads to around 30 seconds gap in Mimir writes and Reads. image

Another issue that I see when the Prometheus replica changes is that I see duplicated values for some metrics, for instance count(count(container_memory_usage_bytes{namespace="$namespace",container="$container",pod=~"$pod"}) by (instance)) with the output: image

The scenario when the HA Tracker replica changes is when the elected Prometheus pod gets terminated, either because node termination, OOM or simply because of pod deletion.

There is no data lost, but we have some sensitive alerts in production that triggers when something like this happens.

Environment

The Mimir and the two Prometheus are running inside the Kubernetes, the Mimir version is 2.11, notice the same behaviour on 2.9 and 2.10. For deployment I used the mimir-distributed helm chart, version 5.0.0 .

Additional Context

Prometheus scrapping time is 30 seconds and when this happens i don't see any error logs or resources spikes in Mimir components. Not sure if relevant but I am not using Memcached for caching.

andreimiclea99 avatar Feb 28 '24 13:02 andreimiclea99

The dropped 30s of data and the duplicated series sound like expected behaviour. Have you tried tuning these three settings?

  # (advanced) Update the timestamp in the KV store for a given cluster/replica
  # only after this amount of time has passed since the current stored
  # timestamp.
  # CLI flag: -distributor.ha-tracker.update-timeout
  [ha_tracker_update_timeout: <duration> | default = 15s]

  # (advanced) Maximum jitter applied to the update timeout, in order to spread
  # the HA heartbeats over time.
  # CLI flag: -distributor.ha-tracker.update-timeout-jitter-max
  [ha_tracker_update_timeout_jitter_max: <duration> | default = 5s]

  # (advanced) If we don't receive any samples from the accepted replica for a
  # cluster in this amount of time we will failover to the next replica we
  # receive a sample from. This value must be greater than the update timeout
  # CLI flag: -distributor.ha-tracker.failover-timeout
  [ha_tracker_failover_timeout: <duration> | default = 30s]

dimitarvdimitrov avatar Feb 28 '24 14:02 dimitarvdimitrov

@dimitarvdimitrov

I forgot to mention that I changed those values with:

      ha_tracker:
        ha_tracker_update_timeout: 5s
        ha_tracker_update_timeout_jitter_max: 5s
        ha_tracker_failover_timeout: 12s

The above screenshots are with the above values, it indeed reduced the gap to around 30 seconds. Before changing those values the gap was 45 seconds, so there was an improvement.

andreimiclea99 avatar Feb 28 '24 14:02 andreimiclea99

I still think the lost scrape is the documented and expected behaviour. Do you see anything that's in the docs that doesn't match with what you observed?

There's also the grafana agent's clustering mode which doesn't rely on mimir's HA to failover, so it might be a bit less noticeable around failures https://grafana.com/docs/agent/latest/flow/concepts/clustering/

dimitarvdimitrov avatar Feb 29 '24 10:02 dimitarvdimitrov

@dimitarvdimitrov To be honest I am not sure if this is the expected behaviour for my case, but sounds like it.

Looks a bit better with these values:

distributor:
  extraArgs:
    distributor.ha-tracker.update-timeout-jitter-max: 2s
    distributor.ha-tracker.update-timeout: 2s
    distributor.ha-tracker.failover-timeout: 5s

Will try to do more tests.

What is a bit annoying is the fact that for e couple minutes it sees duplicated values for some metrics, when the Prometheus change happens.

andreimiclea99 avatar Mar 05 '24 08:03 andreimiclea99