Ability to reconcile offline nodes
Hi,
We are currently trying to write a manager for citus in Kubernetes. Worker node additions work great using the citus_add_node UDF.
The problem we are facing is when we have a worker disappear (Via scale down or outright deletion) the drain and remove node functions cease to work due to citus needing to resolve the worker through DNS.
We have tried using preStop hooks but referring to the Kubernetes documentation this is ran when the pod is terminated, which is too late for citus as, at this point the pod has already had it's networking endpoint removed and cannot be resolved.
I'd love to chat through this as i think it would be useful to establish how to recover from worker nodes being non contactable and also running it in a Kubernetes environment where workers could move and disappear.
Stack
Citus 11 docker on k3d k3d version v5.4.1 k3s version v1.22.7-k3s1 (default)