operator
operator copied to clipboard
Highly available vmalert setup skips some recording evaluations at restart
I'm currently setting up a HA vmalert cluster. The relevant part of its definition is:
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlert
metadata:
name: vmalert-cluster
spec:
replicaCount: 2
I have a group of recording rules with 30s period. When I do kubectl rollout restart deployment ... I observe missing points in recording rules results.
AFAICT, the problem is that vmalerts are being restarted too quick and that vmalerts rule evaluation starts after (not so) random delay: https://github.com/VictoriaMetrics/VictoriaMetrics/blob/cluster/app/vmalert/group.go#L226-L231
I propose that liveness check should become true only after all rule evaluations has been started.
It seems, that vmalert needs support for /ready endpoint for startUp probe. @hagen1778 WDYT ?
Possible workaround, change config for readiness probe at VMAlert.
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlert
metadata:
name: vmalert-cluster
spec:
replicaCount: 2
readinessProbe: {}
I think this problem is more common before https://github.com/VictoriaMetrics/VictoriaMetrics/commit/96db7ac52c5323eacc4005493cdf0aa37b1462bc, because new vmalert could take too much time to restore data instead of creating new. And I think it's hard to determine whether a vmalert is healthy, even if all group started, it doesn't mean all the group is configured right and starting creating new series. So the origin problem[some missing points] is still there. @hagen1778 WDYT
Yes, vmalert indeed has delayed groups evaluation time to avoid thundering herd problem. Restarting of vmalert could result in missing evaluations. However, it is not expected for vmalert to restart frequently as its configuration can be hot-reloaded. And even if one rules evaluation is skipped - that shouldn't cause many issues as VM's query engine should gracefully handle a single missing point. I think, the same problem is related to vmagents restart as they could miss some scrapes, couldn't they?
Closing the issue as stale. Feel free to reopen if needed.