terraform-oci-oke
terraform-oci-oke copied to clipboard
drain_nodes unable to drain Pods with local storage
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Terraform Version and Provider Version
Using OKE v4.2.6 and the following providers:
% terraform -v
Terraform v1.2.8
on darwin_amd64
+ provider registry.terraform.io/hashicorp/cloudinit v2.2.0
+ provider registry.terraform.io/hashicorp/local v2.2.3
+ provider registry.terraform.io/hashicorp/null v3.1.1
+ provider registry.terraform.io/hashicorp/time v0.8.0
+ provider registry.terraform.io/hashicorp/tls v4.0.2
+ provider registry.terraform.io/oracle/oci v4.87.0
Affected Resource(s)
module.oke.module.extensions.null_resource.drain_nodes
Terraform Configuration Files
operator_state = "RUNNING"
upgrade_nodepool = true
node_pools_to_drain = ["main_node_pool_1_23"]
Debug Output
Panic Output
There was no panic, Terraform apply finished successfully
...
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.16.22 drained
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0]: Creation complete after 2m2s [id=6046740775330308523]
Apply complete! Resources: 2 added, 0 changed, 1 destroyed.
Expected Behavior
The drain should have finished successfully, even if the node hosts Pods using local storage (add --delete-emptydir-data flag to drain scripts)
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.16.75 cordoned
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): WARNING: ignoring DaemonSet-managed Pods: contour-system-private/envoy-private-zb4q2, kube-system/csi-oci-node-j8dgn, kube-system/kube-flannel-ds-s8m7b, kube-system/kube-proxy-lwxrk, kube-system/proxymux-client-tnmnh
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod kube-system/coredns-845c966fb4-4684b
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod contour-system-private/contour-certgen-v1.19.0-private-nq7g9
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod cert-manager/cert-manager-webhook-8b876c7db-jh4pg
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod default/nginx-5b75b4c66b-8lk4n
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod default/nginx-5b75b4c66b-pfv5v
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/contour-certgen-v1.19.0-private-nq7g9 evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/cert-manager-webhook-8b876c7db-jh4pg evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/nginx-5b75b4c66b-8lk4n evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/nginx-5b75b4c66b-pfv5v evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/coredns-845c966fb4-4684b evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.16.75 drained
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.16.75 drained
Actual Behavior
The drain was not completed in nodes hosting Pods using local storage
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.30.170 cordoned
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): error: unable to drain node "10.47.30.170" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/sealed-secrets-controller-66848bcc4f-cw8b2, continuing command...
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): There are pending nodes to be drained:
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.30.170
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/sealed-secrets-controller-66848bcc4f-cw8b2
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.30.170 drained
Steps to Reproduce
- Deploy a Pod using local storage (for example, kubernetes-dashboard)
- Set upgrade_nodepool = true and add a node pool to the list node_pools_to_drain
terraform apply
Important Factoids
References
hi @aibarbetta,
Thanks for bringing this to our attention. Looks like you've already figured the solution :) Can you please submit a PR?
@aibarbetta Are you proposing that there should be an option to include this flag? Expecting these resources to be removed first is a safe default.
@aibarbetta Are you proposing that there should be an option to include this flag? Expecting these resources to be removed first is a safe default.
@devoncrouse I'm not sure I agree that expecting this to be removed is a safe default. I attempted to drain the nodes of one of my node pools following the steps documented here, the apply finished successfully but the drain wasn't successful (as you can see in the output shown in "Actual Behavior"). Had I moved on with the upgrade steps of the documentation, I would've terminated my Pods using local storage in a non-graceful way.
I think OKE should fail if the drain fails. Then we can add an option include this flag to the drain, or improve the documentation to ask users to remove these Pods before continuing with the upgrade