eks-anywhere
eks-anywhere copied to clipboard
Improve resource cleanup
When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors
We could automate most of this to improve user experience when debugging.
We already have a --force-cleanup
flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:
- Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
- If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
- Cleanup vsphere vms if it's a vsphere cluster
- Cleanup docker resources if it's a docker cluster
- Delete
<cluster-name>
folder
@g-gaston Is this a duplicate of https://github.com/aws/eks-anywhere/issues/163?
When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a
--force-cleanup
flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:
- Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
- If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
- Cleanup vsphere vms if it's a vsphere cluster
- Cleanup docker resources if it's a docker cluster
- Delete
<cluster-name>
folder
What are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere.
eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster
eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster
When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a
--force-cleanup
flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:
- Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
- If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
- Cleanup vsphere vms if it's a vsphere cluster
- Cleanup docker resources if it's a docker cluster
- Delete
<cluster-name>
folderWhat are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere.
eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster
eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster
@jasonboche try:
kind delete cluster --name prod-eks-a-cluster
kind delete cluster --name prod-eks-a-cluster
Thank you kindly. I'll give that a try!
When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a
--force-cleanup
flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:
- Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
- If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
- Cleanup vsphere vms if it's a vsphere cluster
- Cleanup docker resources if it's a docker cluster
- Delete
<cluster-name>
folderWhat are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere. eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster
@jasonboche try:
kind delete cluster --name prod-eks-a-cluster
It didn't work on me :/
@ataince What output did you get?
@ataince What output did you get?
I think it was abt the memory now I increased it but now it's stuck on this step.
⏳ Collecting support bundle from cluster, this can take a while {"cluster": "dev-cluster", "bundle": "dev-cluster/generated/dev-cluster-2022-03-25T12:40:36Z-bundle.yaml", "since": 1648208436523389598, "kubeconfig": "dev-cluster/dev-cluster-eks-a-cluster.kubeconfig"}
When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a
--force-cleanup
flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:
- Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
- If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
- Cleanup vsphere vms if it's a vsphere cluster
- Cleanup docker resources if it's a docker cluster
- Delete
<cluster-name>
folderWhat are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere. eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster
@jasonboche try:
kind delete cluster --name prod-eks-a-cluster
It didn't work on me :/
@ataince I'm sorry to hear that. I wrapped up this project before I was able to get to the bottom of this. I just learned the many traps to avoid so that I didn't get stuck and have to re-deploy all over again. I'll probably end up revisiting this project within the next year and my hope is by then AWS will have put in much better error trapping and clear cleanup steps that actually work. This isn't a total knock on AWS. I realize this was relatively new and uncharted territory and these types of issues go with the territory until things mature.
Jas
Bumping up the priority on this one as it's pretty important issue that has come up multiple times especially for cleaning up local bootstrap cluster and the cluster-name folder.
Adding my voice. This needs to be prioritized. While figuring out how to get everything working you tear through quite a few clusters. A quick and thorough cleanup is a must.