sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

K8s DaemonSet Incompatible with Autoscaling

Open 11xor6 opened this issue 1 year ago • 5 comments

When the sysbox DaemonSet is deployed against an autoscaling node pool (GKE, but probably relevant on other providers) pods fail to be scheduled on the Node(s). The reason for this is that the RuntimeClass configuration adds the sysbox-runtime: running label to the pod's nodeSelector which then prevents the pod from matching the node pool and in turn preventing scale-up.

Switching the RuntimeClass to use a static label for node selection seems workable given the taint added to the node during installation, however I have randomly (and very rarely) seen issues with pods dying.

11xor6 avatar Dec 17 '24 10:12 11xor6

Just a small update here, I am consistently seeing errors whenever a node scales up. Generally all pods created during a scale up will fail if when they are scheduled to the new node. The failure seems to be that the pod gets scheduled to the node between the time the DaemonSet script removes the taint and the time that the actual RuntimeClass is ready and supported; and the pod fails because the RuntimeClass isn't supported on the Node.

Generally constructs like Deployments, StatefulSets, and the like will hide this error by automatically restarting the pod. I only found this because my current application directly creates a pod. This could be mitigated by not removing the taint until after the RuntimeClass is actually available on the Node.

11xor6 avatar Jan 14 '25 05:01 11xor6

Hi @11xor6, thanks for reporting the issue.

The failure seems to be that the pod gets scheduled to the node between the time the DaemonSet script removes the taint and the time that the actual RuntimeClass is ready and supported; and the pod fails because the RuntimeClass isn't supported on the Node.

Not sure how that can be the case though, because no sysbox-pods will be scheduled on the node until it's labeled with sysbox-runtime=running, and that labeling already occurs ** before ** the taint is removed (see sysbox-deploy-k8s main script here):

1250 │     add_label_to_node "crio-runtime=running"                                                                                                                                                                                                                             
1251 │     add_label_to_node "sysbox-runtime=running"                                                                                                                                                                                                                           
1252 │     rm_taint_from_node "${k8s_taints}" 

So something else must be going on (?)

ctalledo avatar Jan 25 '25 05:01 ctalledo

BTW, in case you want to play around with the sequencing of steps in sysbox-deploy-k8s, you can follow these steps:

  1. git clone the sysbox repo and edit the sysbox-deploy-k8s main script as needed.

  2. From within the sysbox-pkgr/k8s directory type make to build a new sysbox-deploy-k8s container image.

  3. Push that iamge to your repo

  4. Modify the sysbox-install.yaml to point to your new image

  5. kubectl apply -f sysbox-install.yaml to apply it on your k8s cluster.

ctalledo avatar Jan 25 '25 05:01 ctalledo

I haven't had time to do a full rundown, but I suspect that when the node/kubelet is restarted that the labels and taint are reset to that of the NodePool. I have some secondary evidence of this in that the sysbox-runtime label is reset when nodes are restarted for maintenance, and that leads me to believe that it's also happening for other labels and taints.

Thus if that is the case then the pod gets scheduled (and dies) when the node comes back up.

11xor6 avatar Jan 28 '25 22:01 11xor6

We have the same issue. For the workloads we want to run with sysbox waiting for a DaemonSet is not going to work. Our nodes are triggered by incomming jobs, and Karpenter is not going to create new nodes until these jobs are scheduled on actaual nodes. Which is why in our case having nodes (AMI in our case) with pre-installed Sysbox available makes a lot of sense.

while there are alternatives such as building our own AMI using Packer, the fork of oci-r appears to be lagging behind and I am not entirly sure if the changes has been merged or not.

scav avatar Jul 01 '25 14:07 scav