inspektor-gadget
inspektor-gadget copied to clipboard
Testing on EKS start after some days because VPCs aren't removed
The creation of the EKS cluster for running the integration tests is failing with:
"Resource handler returned message: "The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: xxx)"
This is happening because the deletion of the cluster is failing some times leaking some resources, specially the VPC:
Resource handler returned message: "The subnet 'subnet-foo-bar' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: xxx (RequestToken: yyy HandlerErrorCode: InvalidRequest)
I can see there are a lot of pilled up VPCs
Possible Solutions
- Get this fixed upstream :crossed_fingers: (https://github.com/eksctl-io/eksctl/issues/7589)
- Use
--vpc-private-subnets
&--vpc-public-subnets
to avoid creating a new VPC for the cluster - Implement a job that automatically removes leaked VPCs
- Increase our VPC limit and clean it up manually each X days/weeks
Some other possible workarounds, but they should really deal with it upstream:
- They provide a """script""" to find all the resources attached to VPC: https://repost.aws/knowledge-center/troubleshoot-dependency-error-delete-vpc
- Use
--force
:--force Force deletion to continue when errors occur
- Rather than
eksctl
, should we useawk eks
? https://docs.aws.amazon.com/cli/latest/reference/eks/create-cluster.html https://docs.aws.amazon.com/cli/latest/reference/eks/delete-cluster.html https://github.com/aws/aws-cli
I did the following for the time being:
- I opened #2534, perhaps it helps.
- Asked for updating the VPC limit to 200
- Developed this script I'll be running from time to time to manually clean this things up:
#! /bin/bash
set -euox pipefail
# This script tries to delete leaked VPCS and cloud formation stacks of the Inspektor Gadget CI. It
# doesn't check if the resources are actually leaked, so if this is run while an integration test is
# running, it could break it.
region="us-east-2"
###### delete VPCs and their dependencies ######
# get all the vpc ids with tag ig-ci=true
VPCS=$(aws ec2 --region ${region} describe-vpcs --filter "Name=tag:ig-ci,Values=true" | jq -r '.Vpcs[].VpcId')
for vpc in $VPCS
do
echo "Deleting VPC $vpc"
# detach and delete gateways
igw=$(aws ec2 --region ${region} describe-internet-gateways --filters Name=attachment.vpc-id,Values=${vpc} | jq -r .InternetGateways[].InternetGatewayId)
if [ "${igw}" != "null" ]; then
for gw in ${igw}; do
echo "Detaching internet gateway ${gw}"
aws ec2 --region ${region} detach-internet-gateway --internet-gateway-id ${gw} --vpc-id ${vpc}
echo "Deleting internet gateway ${gw}"
aws ec2 --region ${region} delete-internet-gateway --internet-gateway-id ${gw}
done
fi
# delete network interfaces
subnets=$(aws ec2 --region ${region} describe-subnets --filters Name=vpc-id,Values=${vpc} | jq -r .Subnets[].SubnetId)
if [ "${subnets}" != "null" ]; then
for subnet in ${subnets}; do
echo "Deleting network interfgaces in subnet ${subnet}"
# get network interfaces
network_interfaces=$(aws ec2 --region ${region} describe-network-interfaces --filters Name=subnet-id,Values=${subnet} | jq -r .NetworkInterfaces[].NetworkInterfaceId)
if [ "${network_interfaces}" != "null" ]; then
for ni in ${network_interfaces}; do
echo "Deleting network interface ${ni}"
aws ec2 --region ${region} delete-network-interface --network-interface-id ${ni}
done
fi
done
fi
# delete security groups
security_groups=$(aws ec2 --region ${region} \
describe-security-groups --filters Name=vpc-id,Values=${vpc} | jq -r .SecurityGroups[].GroupId)
if [ "${security_groups}" != "null" ]; then
for sg in ${security_groups}; do
# get security group name
sg_name=$(aws ec2 --region ${region} describe-security-groups --group-ids ${sg} | jq -r .SecurityGroups[].GroupName)
if [ "${sg_name}" == "default" ]; then
continue
fi
echo "Deleting security group ${sg}"
aws ec2 --region ${region} delete-security-group --group-id ${sg}
done
fi
# delete subnets
subnets=$(aws ec2 --region ${region} describe-subnets --filters Name=vpc-id,Values=${vpc} | jq -r .Subnets[].SubnetId)
if [ "${subnets}" != "null" ]; then
for subnet in ${subnets}; do
echo "Deleting subnet ${subnet}"
aws ec2 --region ${region} delete-subnet --subnet-id ${subnet}
done
fi
aws ec2 --region ${region} delete-vpc --vpc-id $vpc
done
###### delete cloud formations stacks ######
stack_names=$(aws --region ${region} cloudformation describe-stacks --query 'Stacks[?Tags[?Key == `ig-ci`]]' | jq -r '.[].StackName')
for stack in ${stack_names}
do
echo "Deleting stack ${stack}"
aws --region ${region} cloudformation delete-stack --stack-name ${stack}
done
Rather than eksctl, should we use awk eks?
I tried that but then you need to also create the VPCs, gateways, node groups and everything else, so in the end it's much more complicated than using eksctl.
This looks painful:
Use --vpc-private-subnets & --vpc-public-subnets to avoid creating a new VPC for the cluster
Why don't we go for this option? I mean we will have to create some resources (VPC, gateways, subnets) but that would be one time thing?
Why don't we go for this option? I mean we will have to create some resources (VPC, gateways, subnets) but that would be one time thing?
Besides me not wanting to go through all the pain of setting that again:
- I wanted to have clusters as independent as possible
- It'll be a new dependency to run the CI, so forks will have to setup it in order to run integration tests
Perhaps I'm being too optimistic that it'll be fixed soon upstream and that we can handle some manual cleanup in the meanwhile.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.
I think this is fixed by #2686. Would you mind confirming @burak-ok ?
I think this is fixed by #2686. Would you mind confirming @burak-ok ?
Correct, this is fixed