plural-artifacts
plural-artifacts copied to clipboard
[bootstrap-AWS] plural destroy gets stuck
Summary
When running plural destroy on a kubeflow cluster for teardown, it gets stuck with networking.
The culprit seems to be an active load balancer, once this is gone the VPC teardown including internet gateway / subnets / network interfaces can finish
Reproduction
plural destroy
UI/UX Issue Screenshots
Additional Info about Your Environment
On AWS
Message from the maintainers:
Impacted by this bug? Give it a 👍. We factor engagement into prioritization.
Did this happen for a kubeflow cluster? I think sometimes istio's lbs doesn't quite clean itself up
Yes this happened for a kubeflow cluster. The loadbalancer i mentioned was a AWS resource - i went in and deleted it manually through the console and then Plural was able to tear down the rest and complete the job.
I suspect there are some teardown commands out of sync (with TF?) that it gets stuck. TF probably tries to delete resources that are still blocked by others (e.g. VPC blocked by Loadbalancer).
This is indeed something that we observed regularly with a teardown of Kubeflow clusters on AWS. We investigated, but we're not sure about the specific root cause. It boils down to the "breathing" nature of the Kubeflow cluster. Some in-cluster components, like external-dns, external-lb-controller, istio-operator, you name it, create and manage AWS resources by extension and these components don't clean up their "claimed" AWS resources properly when you do a kubectl delete namespaces --all
. The load balancer is one thing, we have observed other resources like ENIs as well. So, at the end, when you do the terraform destroy
, they still exist and, finally, because they are attached to terraform-managed resources terraform will block the deletion with some dependency error.
Yeah it's something of an inherent limitation of kubernetes, we've made some teardown a bit smoother, like with our standard nginx ingress controller, but I think with istio's extreme network usage, it becomes a lot more tricky. One thing we could theoretically do is find a way to improve the teardown of istio itself to potentially help with this.