piraeus-ha-controller
piraeus-ha-controller copied to clipboard
Bug: node is not reconciled if tainting operation failed
I1101 06:50:28.044916 1 agent.go:440] updating node taints
I1101 06:50:28.103291 1 agent.go:276] managing node taints failed: failed to update node taints: Operation cannot be fulfilled on nodes "srv1": the object has been modified; please apply your changes to the latest version and try again
This error is thrown here:
https://github.com/piraeusdatastore/piraeus-ha-controller/blob/40d3ee8d115dc44f4b326a44e80fa4bd7acdf0ec/pkg/agent/reconcile_failover.go#L152-L154
I guess this could be improved somehow. Ideally, we would not need to retry this, as we could use a proper merge patch, but when I last tried it, it did not work specifically for taints.
Even better would be to move away from tainting directly. One idea would be to have a webhook that either labels all workloads or all PVs with some general anti-affinity, and then only label the node, which should work without having to update the node directly.