bottlerocket
bottlerocket copied to clipboard
[Nvidia variant] Slow shutdown after upgrade
Context
We are running relatively short-lived (few mins - 2h) gpu jobs on our EKS cluster, scaling intensively.
- After upgrading EKS and bottlerocket AMIs from:
bottlerocket-aws-k8s-1.28-x86_64-v1.20.2-536d69d0tobottlerocket-aws-k8s-1.30-x86_64-v1.26.2-360b7a38
We have started receiving alerts during Nodes scale-downs from prometheus alert manager, like:
Alert: One or more targets are unreachable.
Description: 33.33% of the kubelet/kube-prometheus-stack-kubelet targets in kube-system namespace are down.
This led us to Karpenter logs:
{"level":"INFO","time":"2024-12-24T**09:09:30**.804Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (0 pods) ip-xx-xx-xx-xx.us-east-2.compute.internal/g4dn.xlarge/on-demand","commit":"a2875e3","controller":"disruption","namespace":"","name":"","reconcileID":"aca176ce-e29f-4898-901d-18787e3647b6","command-id":"63b57d39-fdf1-4363-9007-6d0cd069633a","reason":"empty"}
...
{"level":"INFO","time":"2024-12-24T**09:15:24**.829Z","logger":"controller","message":"deleted node","commit":"a2875e3","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-xx-xx-xx-xx.us-east-2.compute.internal"},"namespace":"","name":" ip-xx-xx-xx-xx.us-east-2.compute.internal","reconcileID":"b9244d6a-767a-4e10-86d6-92d18247b573"}
Issue is that during shutdown procedure our kubelet metrics scraping starts failing, and EC2 machine has already disconnected, but Karpenter does not proceed with removing them, as it waits for EC2 Node Terminated state. As this does not happen next 5-9 minutes (seen different numbers, never less than 5mins) health check/metrics queries time-out and we get hit with multiple alerts on each nvidia variant scaledown.
Image I'm using: Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia)
What I expected to happen: EC2 machine should go into terminated state much faster, presumably around the time it takes for non-nvidia variant
What actually happened: Node shutdown takes 5+ minutes, no errors reported on karpenter, and the journald logs that I have managed to catch before being disconnected:
Dec 24 09:45:25 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:25.365283234Z" level=info msg="StopContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" with timeout 2 (s)"
Dec 24 09:45:25 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:25.365658983Z" level=info msg="Stop container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" with signal terminated"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.376846500Z" level=info msg="Kill container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: cri-containerd-a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1.scope: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1-rootfs.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552717188Z" level=info msg="shim disconnected" id=a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552775266Z" level=warning msg="cleaning up after shim disconnected" id=a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.552789317Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.570484159Z" level=info msg="StopContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" returns successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.571063746Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.571098765Z" level=info msg="Container to stop \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" must be in running or unknown state, current state \"CONTAINER_EXITED\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.grpc.v1.cri-sandboxes-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611-shm.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: cri-containerd-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611.scope: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644048686Z" level=info msg="shim disconnected" id=c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644125709Z" level=warning msg="cleaning up after shim disconnected" id=c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611 namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.644144951Z" level=info msg="cleaning up dead shim" namespace=k8s.io
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-containerd-io.containerd.runtime.v2.task-k8s.io-c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611-rootfs.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:45:27.717744754 +0000 UTC m=+0.004813718 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:45:27.717818589 +0000 UTC m=+0.004887546 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.723519664Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:27.723549925Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: run-netns-cni\x2dbc681b81\x2d8da7\x2d2106\x2dea0e\x2db24b7c8e9d86.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd-networkd[1197]: eni2b36f300ae6: Link DOWN
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd-networkd[1197]: eni2b36f300ae6: Lost carrier
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788451 1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"core-path\" (UniqueName: \"kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788494 1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788521 1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"kube-api-access-gcdwp\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788526 1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path" (OuterVolumeSpecName: "core-path") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "core-path". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788552 1544 reconciler_common.go:161] "operationExecutor.UnmountVolume started for volume \"tmp\" (UniqueName: \"kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp\") pod \"1e37ebad-a19a-4852-9e84-e806b1219859\" (UID: \"1e37ebad-a19a-4852-9e84-e806b1219859\") "
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788620 1544 reconciler_common.go:289] "Volume detached for volume \"core-path\" (UniqueName: \"kubernetes.io/host-path/1e37ebad-a19a-4852-9e84-e806b1219859-core-path\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.788812 1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp" (OuterVolumeSpecName: "tmp") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "tmp". PluginName "kubernetes.io/empty-dir", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.790690 1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp" (OuterVolumeSpecName: "kube-api-access-gcdwp") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "kube-api-access-gcdwp". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.790738 1544 operation_generator.go:887] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token" (OuterVolumeSpecName: "aws-iam-token") pod "1e37ebad-a19a-4852-9e84-e806b1219859" (UID: "1e37ebad-a19a-4852-9e84-e806b1219859"). InnerVolumeSpecName "aws-iam-token". PluginName "kubernetes.io/projected", VolumeGidValue ""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: var-lib-kubelet-pods-1e37ebad\x2da19a\x2d4852\x2d9e84\x2de806b1219859-volumes-kubernetes.io\x7eprojected-kube\x2dapi\x2daccess\x2dgcdwp.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: var-lib-kubelet-pods-1e37ebad\x2da19a\x2d4852\x2d9e84\x2de806b1219859-volumes-kubernetes.io\x7eprojected-aws\x2diam\x2dtoken.mount: Deactivated successfully.
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888794 1544 reconciler_common.go:289] "Volume detached for volume \"aws-iam-token\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-aws-iam-token\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888833 1544 reconciler_common.go:289] "Volume detached for volume \"kube-api-access-gcdwp\" (UniqueName: \"kubernetes.io/projected/1e37ebad-a19a-4852-9e84-e806b1219859-kube-api-access-gcdwp\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.888846 1544 reconciler_common.go:289] "Volume detached for volume \"tmp\" (UniqueName: \"kubernetes.io/empty-dir/1e37ebad-a19a-4852-9e84-e806b1219859-tmp\") on node \"ip-xx-xx-xx-xx.us-east-2.compute.internal\" DevicePath \"\""
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:27.921626 1544 kubelet_volumes.go:163] "Cleaned up orphaned pod volumes dir" podUID="1e37ebad-a19a-4852-9e84-e806b1219859" path="/var/lib/kubelet/pods/1e37ebad-a19a-4852-9e84-e806b1219859/volumes"
Dec 24 09:45:27 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: Removed slice libcontainer container kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice.
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.329359 1544 scope.go:117] "RemoveContainer" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal systemd[1]: kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice: Failed to open /run/systemd/transient/kubepods-burstable-pod1e37ebad_a19a_4852_9e84_e806b1219859.slice: No such file or directory
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.330830934Z" level=info msg="RemoveContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\""
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.333926033Z" level=info msg="RemoveContainer for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" returns successfully"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.334130 1544 scope.go:117] "RemoveContainer" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:45:28.334322404Z" level=error msg="ContainerStatus for \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\" failed" error="rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: E1224 09:45:28.334468 1544 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found" containerID="a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"
Dec 24 09:45:28 ip-xx-xx-xx-xx.us-east-2.compute.internal kubelet[1544]: I1224 09:45:28.334505 1544 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={"Type":"containerd","ID":"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1"} err="failed to get container status \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": rpc error: code = NotFound desc = an error occurred when try to find container \"a363fe13a05cfebe4be8677ace6469fbddf10878a690eb77e8ce72b0de6773b1\": not found"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.880259964Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.898322951 +0000 UTC m=+0.004926890 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.898404069 +0000 UTC m=+0.005007982 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.902816735Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.902845703Z" level=info msg="StopPodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.903346648Z" level=info msg="RemovePodSandbox for \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.903378313Z" level=info msg="Forcibly stopping sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\""
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.92104592 +0000 UTC m=+0.005342552 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: 2024-12-24 09:46:15.921147761 +0000 UTC m=+0.005444388 write error: can't make directories for new logfile: mkdir /host: read-only file system
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.925076206Z" level=info msg="TearDown network for sandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" successfully"
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.928024195Z" level=warning msg="Failed to get podSandbox status for container event for sandboxID \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\": an error occurred when try to find sandbox: not found. Sending the event with nil podSandboxStatus."
Dec 24 09:46:15 ip-xx-xx-xx-xx.us-east-2.compute.internal containerd[1487]: time="2024-12-24T09:46:15.928079760Z" level=info msg="RemovePodSandbox \"c00e66d8003921dff6c2f408a3b34c9df494bf694021b3455a49635d385bf611\" returns successfully"
How to reproduce the problem:
- Use
bottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.26.2-360b7a38AMI (also happens with newest build for EKS 1.30 variant) - Scale down deployment in order to trigger Karpenter disruption with EmptyNode, or even drain the Node, any way to trigger termination
- EC2 should be in Shutting-down state for a few minutes
Thanks for this report! Do you think you could check the EC2 console logs for any suspicious messages?
I'll attempt to reproduce, but do you have any pointers about your environment that might be influencing the result? Any additional devices (volumes, network mounts) or network configurations that you think might be relevant could be useful.
Hi @cbgbt thanks for prompt response! Our setup seems ordinary, but sending our template just in case:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: gpu-object-detection
spec:
instanceProfile: <instance_profile>
subnetSelectorTerms:
- tags:
Name: "Service VPC Private*"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: secondary
amiFamily: "Bottlerocket"
amiSelectorTerms:
- name: "bottlerocket-aws-k8s-1.30-x86_64-v1.26.2-360b7a38"
userData: |
[settings.boot.kernel-parameters]
"crashkernel" = [
"2G-:256M",
]
[settings.kernel.sysctl]
"net.core.rmem_max" = "16777216"
"net.core.wmem_max" = "16777216"
"kernel.core_pattern" = "/core-dumps/core-%p.%h.%t"
[settings.ntp]
"time-servers" = [
"169.254.169.123",
"time.google.com"
]
[settings.kubernetes]
"image-gc-low-threshold-percent" = "85"
"image-gc-high-threshold-percent" = "90"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 4Gi
volumeType: gp3
encrypted: true
- deviceName: /dev/xvdb
ebs:
volumeSize: 40Gi
volumeType: gp3
encrypted: true
tags:
ManagedBy: "Karpenter"
karpenter.sh/discovery: secondary
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-object-detection
spec:
template:
metadata:
labels:
node.kubernetes/role: "gpu"
node.kubernetes/type: "internal"
dedicated: object-detection
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-object-detection
requirements:
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
- key: "node.kubernetes.io/instance-type"
operator: In
values:
- "g4dn.xlarge"
- "g4dn.2xlarge"
- "g4dn.4xlarge"
- "g5.xlarge"
- "g5.2xlarge"
- "g5.4xlarge"
taints:
- key: dedicated
value: object-detection
effect: NoSchedule
limits:
nvidia.com/gpu: 50
disruption:
consolidationPolicy: WhenEmpty
consolidateAfter: 30s
Also I have checked Console Log, showing from first noted Error, al previous were in status OK e.g.
...
[FAILED] Failed unmounting Kernel Modules (Read-Write).
[ OK ] Stopped Flush Journal to Persistent Storage.
[ OK ] Stopped target Swaps.
Unmounting /var/lib/kubelet...
[ OK ] Stopped Prepare Opt Directory (/opt).
Unmounting Opt Directory (/opt)...
[ OK ] Stopped Prepare Var Directory (/var).
[ OK ] Unmounted Opt Directory (/opt).
[ OK ] Unmounted /var/lib/kubelet.
[ OK ] Stopped target Preparation for Local File Systems.
Unmounting Var Directory (/var)...
[ OK ] Stopped Create Static Device Nodes in /dev.
[ OK ] Stopped Create System Users.
[ OK ] Stopped Remount Root and Kernel File Systems.
[ OK ] Reached target System Shutdown.
[ OK ] Unmounted Var Directory (/var).
Unmounting Local Directory (/local)...
[ OK ] Unmounted Local Directory (/local).
[ OK ] Reached target Unmount All Filesystems.
[ OK ] Reached target Late Shutdown Services.
[ OK ] Finished System Power Off.
[ OK ] Reached target System Power Off.
[ 194.586797] (direxec)[5943]: Failed to execute /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system-shutdown/mdadm.shutdown: No such file or directory
[ 194.649214] (sd-executor)[5940]: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system-shutdown/mdadm.shutdown failed with exit status 1.
[ 194.712196] systemd-shutdown[1]: Failed to finalize DM devices, ignoring.
[ 195.564778] reboot: Power down
During investigation, I have removed all pods that are used specifically by nvidia machines and had approximately the same config as on other nodeClasses/Pools (except that they are non-gpu instances). Non-gpu instances are stopped much faster, within a minute or so. I have also tried reverting to older AMI but interestingly enough - instance shutdown also took 5-10 minutes to shutdown.
So possible that slow shutdown was there even before, but was hidden by how the Karpenter used to handle nodes (remove it from EKS cluster, and leave EC2 shutdown in the background, opposed to new version v1.+ where it marks it NonReady and waits for a EC2 machine to be completely terminated) which triggered our alerts
If you need any more info, I'll be happy to provide.