operator The operator sometimes fails to detect VMAlert resources

The operator sometimes fails to detect VMAlert resources in my cluster (e2e CI environment).

In this case, there were no logs that the operator had detected VMAlert resources and vmalert-* Deployments hadn't be created. I have not found that other types of resources (VMAgent, VMAlertmanager, VMSingle, VMCluster) hadn't be detected.

After I restarted the operator (by kubectl rollout restart), the operator successfully detected VMAlert resources and their Deployments were created. Probably leaving it for a several hours will also result in the same behavior after a periodic forced reconciliation (but not tested).

What else should I investigate?

Sep 27 '22 07:09 umezawatakeshi

Hello, looks like bug to me. We're going to add additional e2e tests soon, will try to catch it.

Sep 27 '22 08:09 f41gh7

@f41gh7 Hi, I'm his colleague.

In our test environment, we apply many resources (VMAelrt * 2、VMPodScrape * 36、VMServiceScrape * 25) at the same time by the Argo CD. And, some reconcile requests have been throttled by the rate limitter (https://github.com/VictoriaMetrics/operator/commit/dfb6a14e1193089ba5ab112e0acf4e459aba68b4).

Could you improve the rate limitter? With the current implementation, some reconciliation requests are ignored. When RequeueAfter in ctrl.Result is set, the controller-runtime requeue the request after the specified time.

return ctrl.Result{RequeueAfter: N * time.Second}, nil

ref: https://github.com/cybozu-go/cattage/blob/4018694c91cf3794a91243a6845bead992a065c3/controllers/application_controller.go#L173-L175

Sep 30 '22 09:09 masa213f

Thanks a lot for investigating it! I'll make fix for this case.

Sep 30 '22 14:09 f41gh7

Sorry for delay. I've made research and best option for me change env var VM_FORCERESYNCINTERVAL for testing purpose. It should reconcile vmalert and vmagent after given interval. Main reason, why I'm not happy with simple ctrl.Result{RequeueAfter: N * time.Second} for each throttled object, it may significantly reduce operator performance and increase resource usage. At other hand, currently batch changes (more then 5 objects in 2 seconds) will be throttled and resynced in ~60 seconds.

Oct 24 '22 14:10 f41gh7

At other hand, currently batch changes (more then 5 objects in 2 seconds) will be throttled and resynced in ~60 seconds.

IMHO, this is not true, since you are return ctrl.Result{}, nil. The object will never be requeued until a new event is coming(such as udpate the vmalert) or it reaches the SyncPeriod of controller-manager.https://github.com/kubernetes-sigs/controller-runtime/blob/4c9c9564e4652bbdec14a602d6196d8622500b51/pkg/manager/manager.go#L135

As for resync, only those being reconciled will be resynced. https://github.com/VictoriaMetrics/operator/blob/6540abd39d57f35ee5bd089427333ab92f4b2c40/controllers/vmalert_controller.go#L126-L128

Say, for example, there are 10 vmalert CRs, and the operator starts, controller-manager will enqueue all the objects and reconcile them. Consider the following case, a - represent a CR.

0s    1s    2s
|-----|-----|

the first 5 CR got reconciled, and the budget is running out, causing the next five CRs being throttled, which means these CRs will never be reconciled util a new event is coming. I think this is not acceptable as it results in inconsistency.

Oct 29 '22 08:10 just1900

Looks like the throttle limits on vmalert already be removed by https://github.com/VictoriaMetrics/operator/commit/63ca52bf140b033ecbc3c40f9efc8579b936ea29, close this as completed. Feel free to re-open if have further question.

Jul 08 '23 06:07 Haleygo