eks-anywhere icon indicating copy to clipboard operation
eks-anywhere copied to clipboard

Improve resource cleanup

Open g-gaston opened this issue 3 years ago • 10 comments

When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a --force-cleanup flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:

  • Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
  • If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
  • Cleanup vsphere vms if it's a vsphere cluster
  • Cleanup docker resources if it's a docker cluster
  • Delete <cluster-name> folder

g-gaston avatar Sep 16 '21 15:09 g-gaston

@g-gaston Is this a duplicate of https://github.com/aws/eks-anywhere/issues/163?

vivek-koppuru avatar Sep 27 '21 23:09 vivek-koppuru

When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a --force-cleanup flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:

  • Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
  • If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
  • Cleanup vsphere vms if it's a vsphere cluster
  • Cleanup docker resources if it's a docker cluster
  • Delete <cluster-name> folder

What are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere.

eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster

eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster

jasonboche avatar Oct 14 '21 16:10 jasonboche

When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a --force-cleanup flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:

  • Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
  • If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
  • Cleanup vsphere vms if it's a vsphere cluster
  • Cleanup docker resources if it's a docker cluster
  • Delete <cluster-name> folder

What are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere.

eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster

eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster

@jasonboche try:

kind delete cluster --name prod-eks-a-cluster

g-gaston avatar Oct 22 '21 15:10 g-gaston

kind delete cluster --name prod-eks-a-cluster

Thank you kindly. I'll give that a try!

jasonboche avatar Oct 25 '21 17:10 jasonboche

When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a --force-cleanup flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:

  • Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
  • If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
  • Cleanup vsphere vms if it's a vsphere cluster
  • Cleanup docker resources if it's a docker cluster
  • Delete <cluster-name> folder

What are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere. eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster

@jasonboche try:

kind delete cluster --name prod-eks-a-cluster

It didn't work on me :/

ataince avatar Mar 25 '22 10:03 ataince

@ataince What output did you get?

chrisdoherty4 avatar Mar 25 '22 13:03 chrisdoherty4

@ataince What output did you get?

I think it was abt the memory now I increased it but now it's stuck on this step.

⏳ Collecting support bundle from cluster, this can take a while {"cluster": "dev-cluster", "bundle": "dev-cluster/generated/dev-cluster-2022-03-25T12:40:36Z-bundle.yaml", "since": 1648208436523389598, "kubeconfig": "dev-cluster/dev-cluster-eks-a-cluster.kubeconfig"}

ataince avatar Mar 25 '22 13:03 ataince

When a cluster creation fails, the process of cleaning up resources is manual, cumbersome and prone to errors We could automate most of this to improve user experience when debugging. We already have a --force-cleanup flag, it just doesn't do a lot. Think about everything you need to do when a cluster creation fails before running the cli again, that's what we should try to add to this flow. Examples:

  • Delete bootstrap cluster. We do that today, but it's not super robust. Find when it doesn't work and fix it.
  • If we don't support more than one kind cluster running, even if it's not an eks-a one, add a validation for this and give instructions to delete them
  • Cleanup vsphere vms if it's a vsphere cluster
  • Cleanup docker resources if it's a docker cluster
  • Delete <cluster-name> folder

What are the manual cleanup steps for a failed cluster deployment? My administrative machine is stuck after a failed deployment to vSphere. I've already powered off and deleted the VMs manually from vSphere. eksctl anywhere create cluster -f eksa-cluster.yaml Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name "prod-eks-a-cluster" , try rerunning with --force-cleanup to force delete previously created bootstrap cluster eksctl anywhere create cluster -f eksa-cluster.yaml --force-cleanup Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster

@jasonboche try:

kind delete cluster --name prod-eks-a-cluster

It didn't work on me :/

@ataince I'm sorry to hear that. I wrapped up this project before I was able to get to the bottom of this. I just learned the many traps to avoid so that I didn't get stuck and have to re-deploy all over again. I'll probably end up revisiting this project within the next year and my hope is by then AWS will have put in much better error trapping and clear cleanup steps that actually work. This isn't a total knock on AWS. I realize this was relatively new and uncharted territory and these types of issues go with the territory until things mature.

Jas

jasonboche avatar Mar 28 '22 21:03 jasonboche

Bumping up the priority on this one as it's pretty important issue that has come up multiple times especially for cleaning up local bootstrap cluster and the cluster-name folder.

abhinavmpandey08 avatar Jul 08 '22 21:07 abhinavmpandey08

Adding my voice. This needs to be prioritized. While figuring out how to get everything working you tear through quite a few clusters. A quick and thorough cleanup is a must.

AndreasDavour avatar Nov 14 '22 13:11 AndreasDavour