piraeus-ha-controller icon indicating copy to clipboard operation
piraeus-ha-controller copied to clipboard

Bug: node is not reconciled if tainting operation failed

Open kvaps opened this issue 1 year ago • 1 comments

I1101 06:50:28.044916       1 agent.go:440] updating node taints
I1101 06:50:28.103291       1 agent.go:276] managing node taints failed: failed to update node taints: Operation cannot be fulfilled on nodes "srv1": the object has been modified; please apply your changes to the latest version and try again

This error is thrown here:

https://github.com/piraeusdatastore/piraeus-ha-controller/blob/40d3ee8d115dc44f4b326a44e80fa4bd7acdf0ec/pkg/agent/reconcile_failover.go#L152-L154

kvaps avatar Nov 01 '24 07:11 kvaps

I guess this could be improved somehow. Ideally, we would not need to retry this, as we could use a proper merge patch, but when I last tried it, it did not work specifically for taints.

Even better would be to move away from tainting directly. One idea would be to have a webhook that either labels all workloads or all PVs with some general anti-affinity, and then only label the node, which should work without having to update the node directly.

WanzenBug avatar Nov 04 '24 07:11 WanzenBug