els-blueprints: When destroying the whole network resources (VPC, Subnet, Routetable, NACL, SG) are left
Describe the bug
When calling cdk destroy all or most of the networking resourcing was left un-destroyed.
Expected Behavior
All the resources where destroyed
Current Behavior
Resources like (VPC, Subnet, Routetable, NACL, IGW, NetworkInterfaces, SG). I had to find a Gist that I found to identify and then destroy them manually.
Reproduction Steps
Added a list of Addons
// AddOns for the cluster.
const addOns: Array<blueprints.ClusterAddOn> = [
new blueprints.addons.FluxCDAddOn,
new blueprints.addons.SSMAgentAddOn,
new blueprints.addons.ClusterAutoScalerAddOn,
new blueprints.addons.AwsLoadBalancerControllerAddOn(),
//new blueprints.addons.VpcCniAddOn(),
new blueprints.addons.CertManagerAddOn(),
new blueprints.addons.ExternalDnsAddOn({
hostedZoneResources: [blueprints.GlobalResources.HostedZone]
}),
new blueprints.addons.EfsCsiDriverAddOn({kmsKeys: [kmsKey]}),
new blueprints.addons.EbsCsiDriverAddOn(),
new blueprints.addons.IngressNginxAddOn()
];
Then created the cluster:
const stack = blueprints.EksBlueprint.builder()
.version('auto')
.account(account)
.region(region)
.clusterProvider(clusterProvider)
.resourceProvider(blueprints.GlobalResources.Vpc, new blueprints.VpcProvider(undefined, { primaryCidr: envContext.vpcCidr }))
.resourceProvider(blueprints.GlobalResources.HostedZone, new blueprints.ImportHostedZoneProvider(r53HostedZone.hostedZoneId, hostedZoneName))
.resourceProvider(blueprints.GlobalResources.KmsKey, new blueprints.CreateKmsKeyProvider())
.resourceProvider("s3-bucket", new blueprints.CreateS3BucketProvider({
name: envContext.s3BucketName+'.'+account+'.'+region,
id: envContext.s3BucketName,
s3BucketProps: { removalPolicy: RemovalPolicy.DESTROY },
}))
.addOns(...addOns)
.build(this, 'my-eks-blueprint');
Possible Solution
No response
Additional Information/Context
No response
CDK CLI Version
2.147.3 (build 32f0fdb)
EKS Blueprints Version
1.15.1
Node.js Version
v22.2.0
Environment details (OS name and version, etc.)
sw_vers ProductName: macOS ProductVersion: 14.5 BuildVersion: 23F79
Other information
No response
@jesperalmstrom what is the status of the stack in cloudformation after you destroy it? Sometimes destroy command won't finish leaving resources behind, that happens for example if any of the resources are modified outside of the stack. In that case the CFN detects drift and stops.
I did a faulty deploy (wrong region) so i destroyed almost immediately. There should not have been any drift.
The CFN stack got several of these
This resource failed to delete. It was skipped and retained using the Force Delete Stack mode.
When I tried force delete it would not succeed because of dependencies. Took me hours to find and understand all the dependencies. I found a script that I modified to be able to delete them (slight modified version of this https://gist.github.com/alberto-morales/b6d7719763f483185db27289d51f8ec5).
@shapirov103 do you have any ideas or tricks?
This is not the expected behavior, however, before qualifying it as a defect, please share what were the dependencies that you discovered? For example, I see that you used Flux. Flux can in turn provision apps that will fail to be removed if flux controller is destroyed, this is especially true for any CRDs from flux which then fail to be cleaned up because no controller is available.
CFN is expected to remove all resources that were provisioned, provided there was no change to the resources.
Thanks for the response I will try to remove Flux and see if the delete becomes more smooth.
It's mainly VPC network related resources that is left behind when I tried to remove the stack.
I have a similar situation. I created a cluster with the 1.17.2 blueprints. After the stack was created and the cluster was stable,
Nothing was added to the cluster after the cdl stack was deployed.
I decided to delete the cloud formation stack. After an hour, it failed with this error: CloudFormation did not receive a response from your Custom Resource. Please check your logs for requestId [f397410c-726d-4886-9a67-00c57eda7759]. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version.
We've seen this multiple times. I wonder if there is a recommended approach for destroying EKS Blueprint CDK stacks.
@vpopiolrccl are there any resources that are automatically created outside of the CDK stack? For example, in order to deal with PVRE some customers automatically add permissions to the cluster roles based on a trigger (e.g. EKS cluster is created). In this case CDK and CFN fails when it detects any resource modification outside of the stack. Another example could be provisioning of ingress or loadbalancer in EKS (e.g. using gitops) and then failure to clean up. That may hold VPC resources as it has network interfaces in the subnets. We test routinely that our stacks are created and destroyed successfully, so trying to identify the difference here.
@zjaco13 ^^