terraform-oci-oke drain_nodes unable to drain Pods with local storage

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version and Provider Version

Using OKE v4.2.6 and the following providers:

% terraform -v
Terraform v1.2.8
on darwin_amd64
+ provider registry.terraform.io/hashicorp/cloudinit v2.2.0
+ provider registry.terraform.io/hashicorp/local v2.2.3
+ provider registry.terraform.io/hashicorp/null v3.1.1
+ provider registry.terraform.io/hashicorp/time v0.8.0
+ provider registry.terraform.io/hashicorp/tls v4.0.2
+ provider registry.terraform.io/oracle/oci v4.87.0

Affected Resource(s)

module.oke.module.extensions.null_resource.drain_nodes

Terraform Configuration Files

operator_state      = "RUNNING"
upgrade_nodepool    = true
node_pools_to_drain = ["main_node_pool_1_23"]

Debug Output

Panic Output

There was no panic, Terraform apply finished successfully

...
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.16.22 drained
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0]: Creation complete after 2m2s [id=6046740775330308523]

Apply complete! Resources: 2 added, 0 changed, 1 destroyed.

Expected Behavior

The drain should have finished successfully, even if the node hosts Pods using local storage (add --delete-emptydir-data flag to drain scripts)

module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.16.75 cordoned
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): WARNING: ignoring DaemonSet-managed Pods: contour-system-private/envoy-private-zb4q2, kube-system/csi-oci-node-j8dgn, kube-system/kube-flannel-ds-s8m7b, kube-system/kube-proxy-lwxrk, kube-system/proxymux-client-tnmnh
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod kube-system/coredns-845c966fb4-4684b
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod contour-system-private/contour-certgen-v1.19.0-private-nq7g9
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod cert-manager/cert-manager-webhook-8b876c7db-jh4pg
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod default/nginx-5b75b4c66b-8lk4n
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): evicting pod default/nginx-5b75b4c66b-pfv5v
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/contour-certgen-v1.19.0-private-nq7g9 evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/cert-manager-webhook-8b876c7db-jh4pg evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/nginx-5b75b4c66b-8lk4n evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/nginx-5b75b4c66b-pfv5v evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): pod/coredns-845c966fb4-4684b evicted
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.16.75 drained
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.16.75 drained

Actual Behavior

The drain was not completed in nodes hosting Pods using local storage

module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): node/10.47.30.170 cordoned
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): error: unable to drain node "10.47.30.170" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/sealed-secrets-controller-66848bcc4f-cw8b2, continuing command...
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): There are pending nodes to be drained:
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec):  10.47.30.170
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): cannot delete Pods with local storage (use --delete-emptydir-data to override): kube-system/sealed-secrets-controller-66848bcc4f-cw8b2
module.kubernetes.module.oke.module.extensions.null_resource.drain_nodes[0] (remote-exec): 10.47.30.170 drained

Steps to Reproduce

Deploy a Pod using local storage (for example, kubernetes-dashboard)
Set upgrade_nodepool = true and add a node pool to the list node_pools_to_drain
terraform apply

Important Factoids

References

Sep 13 '22 18:09 aibarbetta

hi @aibarbetta,

Thanks for bringing this to our attention. Looks like you've already figured the solution :) Can you please submit a PR?

Sep 13 '22 22:09 hyder

@aibarbetta Are you proposing that there should be an option to include this flag? Expecting these resources to be removed first is a safe default.

Sep 13 '22 22:09 devoncrouse

@aibarbetta Are you proposing that there should be an option to include this flag? Expecting these resources to be removed first is a safe default.

@devoncrouse I'm not sure I agree that expecting this to be removed is a safe default. I attempted to drain the nodes of one of my node pools following the steps documented here, the apply finished successfully but the drain wasn't successful (as you can see in the output shown in "Actual Behavior"). Had I moved on with the upgrade steps of the documentation, I would've terminated my Pods using local storage in a non-graceful way.

I think OKE should fail if the drain fails. Then we can add an option include this flag to the drain, or improve the documentation to ask users to remove these Pods before continuing with the upgrade

Sep 13 '22 23:09 aibarbetta

terraform-oci-oke terraform-oci-oke copied to clipboard

drain_nodes unable to drain Pods with local storage

Community Note

Terraform Version and Provider Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

terraform-oci-oke
terraform-oci-oke copied to clipboard