plural-artifacts icon indicating copy to clipboard operation
plural-artifacts copied to clipboard

[bootstrap-AWS] plural destroy gets stuck

Open jaystary opened this issue 2 years ago • 4 comments

Summary

When running plural destroy on a kubeflow cluster for teardown, it gets stuck with networking.

The culprit seems to be an active load balancer, once this is gone the VPC teardown including internet gateway / subnets / network interfaces can finish

Reproduction

plural destroy

UI/UX Issue Screenshots

Additional Info about Your Environment

On AWS


Message from the maintainers:

Impacted by this bug? Give it a 👍. We factor engagement into prioritization.

jaystary avatar Mar 06 '22 18:03 jaystary

Did this happen for a kubeflow cluster? I think sometimes istio's lbs doesn't quite clean itself up

michaeljguarino avatar Mar 16 '22 22:03 michaeljguarino

Yes this happened for a kubeflow cluster. The loadbalancer i mentioned was a AWS resource - i went in and deleted it manually through the console and then Plural was able to tear down the rest and complete the job.

I suspect there are some teardown commands out of sync (with TF?) that it gets stuck. TF probably tries to delete resources that are still blocked by others (e.g. VPC blocked by Loadbalancer).

jaystary avatar Mar 17 '22 11:03 jaystary

This is indeed something that we observed regularly with a teardown of Kubeflow clusters on AWS. We investigated, but we're not sure about the specific root cause. It boils down to the "breathing" nature of the Kubeflow cluster. Some in-cluster components, like external-dns, external-lb-controller, istio-operator, you name it, create and manage AWS resources by extension and these components don't clean up their "claimed" AWS resources properly when you do a kubectl delete namespaces --all. The load balancer is one thing, we have observed other resources like ENIs as well. So, at the end, when you do the terraform destroy, they still exist and, finally, because they are attached to terraform-managed resources terraform will block the deletion with some dependency error.

rauerhans avatar Mar 17 '22 12:03 rauerhans

Yeah it's something of an inherent limitation of kubernetes, we've made some teardown a bit smoother, like with our standard nginx ingress controller, but I think with istio's extreme network usage, it becomes a lot more tricky. One thing we could theoretically do is find a way to improve the teardown of istio itself to potentially help with this.

michaeljguarino avatar Mar 17 '22 12:03 michaeljguarino