operator icon indicating copy to clipboard operation
operator copied to clipboard

Highly available vmalert setup skips some recording evaluations at restart

Open g7r opened this issue 4 years ago • 1 comments
trafficstars

I'm currently setting up a HA vmalert cluster. The relevant part of its definition is:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlert
metadata:
  name: vmalert-cluster

spec:
  replicaCount: 2

I have a group of recording rules with 30s period. When I do kubectl rollout restart deployment ... I observe missing points in recording rules results.

AFAICT, the problem is that vmalerts are being restarted too quick and that vmalerts rule evaluation starts after (not so) random delay: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/app/vmalert/group.go#L226-L231

I propose that liveness check should become true only after all rule evaluations has been started.

g7r avatar Oct 13 '21 16:10 g7r

It seems, that vmalert needs support for /ready endpoint for startUp probe. @hagen1778 WDYT ?

Possible workaround, change config for readiness probe at VMAlert.

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlert
metadata:
 name: vmalert-cluster
spec:
 replicaCount: 2
 readinessProbe: {}

f41gh7 avatar Oct 15 '21 18:10 f41gh7

I think this problem is more common before https://github.com/VictoriaMetrics/VictoriaMetrics/commit/96db7ac52c5323eacc4005493cdf0aa37b1462bc, because new vmalert could take too much time to restore data instead of creating new. And I think it's hard to determine whether a vmalert is healthy, even if all group started, it doesn't mean all the group is configured right and starting creating new series. So the origin problem[some missing points] is still there. @hagen1778 WDYT

Haleygo avatar Jul 08 '23 07:07 Haleygo

Yes, vmalert indeed has delayed groups evaluation time to avoid thundering herd problem. Restarting of vmalert could result in missing evaluations. However, it is not expected for vmalert to restart frequently as its configuration can be hot-reloaded. And even if one rules evaluation is skipped - that shouldn't cause many issues as VM's query engine should gracefully handle a single missing point. I think, the same problem is related to vmagents restart as they could miss some scrapes, couldn't they?

hagen1778 avatar Jul 08 '23 19:07 hagen1778

Closing the issue as stale. Feel free to reopen if needed.

hagen1778 avatar Jul 14 '23 13:07 hagen1778