inspektor-gadget icon indicating copy to clipboard operation
inspektor-gadget copied to clipboard

Testing on EKS start after some days because VPCs aren't removed

Open mauriciovasquezbernal opened this issue 1 year ago • 4 comments

The creation of the EKS cluster for running the integration tests is failing with:

"Resource handler returned message: "The maximum number of VPCs has been reached. (Service: Ec2, Status Code: 400, Request ID: xxx)"

This is happening because the deletion of the cluster is failing some times leaking some resources, specially the VPC:

Resource handler returned message: "The subnet 'subnet-foo-bar' has dependencies and cannot be deleted. (Service: Ec2, Status Code: 400, Request ID: xxx (RequestToken: yyy HandlerErrorCode: InvalidRequest)

I can see there are a lot of pilled up VPCs

vpcs

Possible Solutions

  • Get this fixed upstream :crossed_fingers: (https://github.com/eksctl-io/eksctl/issues/7589)
  • Use --vpc-private-subnets & --vpc-public-subnets to avoid creating a new VPC for the cluster
  • Implement a job that automatically removes leaked VPCs
  • Increase our VPC limit and clean it up manually each X days/weeks

mauriciovasquezbernal avatar Feb 22 '24 16:02 mauriciovasquezbernal

Some other possible workarounds, but they should really deal with it upstream:

  1. They provide a """script""" to find all the resources attached to VPC: https://repost.aws/knowledge-center/troubleshoot-dependency-error-delete-vpc
  2. Use --force: --force Force deletion to continue when errors occur
  3. Rather than eksctl, should we use awk eks? https://docs.aws.amazon.com/cli/latest/reference/eks/create-cluster.html https://docs.aws.amazon.com/cli/latest/reference/eks/delete-cluster.html https://github.com/aws/aws-cli

eiffel-fl avatar Feb 23 '24 03:02 eiffel-fl

I did the following for the time being:

  • I opened #2534, perhaps it helps.
  • Asked for updating the VPC limit to 200
  • Developed this script I'll be running from time to time to manually clean this things up:
#! /bin/bash

set -euox pipefail

# This script tries to delete leaked VPCS and cloud formation stacks of the Inspektor Gadget CI. It
# doesn't check if the resources are actually leaked, so if this is run while an integration test is
# running, it could break it.

region="us-east-2"

###### delete VPCs and their dependencies ######
# get all the vpc ids with tag ig-ci=true
VPCS=$(aws ec2 --region ${region} describe-vpcs --filter "Name=tag:ig-ci,Values=true" | jq -r '.Vpcs[].VpcId')

for vpc in $VPCS
do
  echo "Deleting VPC $vpc"

  # detach and delete gateways
  igw=$(aws ec2 --region ${region} describe-internet-gateways --filters Name=attachment.vpc-id,Values=${vpc} | jq -r .InternetGateways[].InternetGatewayId)
  if [ "${igw}" != "null" ]; then
    for gw in ${igw}; do
      echo "Detaching internet gateway ${gw}"
      aws ec2 --region ${region} detach-internet-gateway --internet-gateway-id ${gw} --vpc-id ${vpc}
      echo "Deleting internet gateway ${gw}"
      aws ec2 --region ${region} delete-internet-gateway --internet-gateway-id ${gw}
    done
  fi

  # delete network interfaces
  subnets=$(aws ec2 --region ${region} describe-subnets --filters Name=vpc-id,Values=${vpc} | jq -r .Subnets[].SubnetId)
  if [ "${subnets}" != "null" ]; then
    for subnet in ${subnets}; do
      echo "Deleting network interfgaces in subnet ${subnet}"

      # get network interfaces
      network_interfaces=$(aws ec2 --region ${region} describe-network-interfaces --filters Name=subnet-id,Values=${subnet} | jq -r .NetworkInterfaces[].NetworkInterfaceId)
      if [ "${network_interfaces}" != "null" ]; then
        for ni in ${network_interfaces}; do
          echo "Deleting network interface ${ni}"
          aws ec2 --region ${region} delete-network-interface --network-interface-id ${ni}
        done
      fi
    done
  fi

  # delete security groups
  security_groups=$(aws ec2 --region ${region} \
    describe-security-groups --filters Name=vpc-id,Values=${vpc} | jq -r .SecurityGroups[].GroupId)
  if [ "${security_groups}" != "null" ]; then
    for sg in ${security_groups}; do
      # get security group name
      sg_name=$(aws ec2 --region ${region} describe-security-groups --group-ids ${sg} | jq -r .SecurityGroups[].GroupName)
      if [ "${sg_name}" == "default" ]; then
        continue
      fi
      echo "Deleting security group ${sg}"
      aws ec2 --region ${region} delete-security-group --group-id ${sg}
    done
  fi

  # delete subnets
  subnets=$(aws ec2 --region ${region} describe-subnets --filters Name=vpc-id,Values=${vpc} | jq -r .Subnets[].SubnetId)
  if [ "${subnets}" != "null" ]; then
    for subnet in ${subnets}; do
      echo "Deleting subnet ${subnet}"
      aws ec2 --region ${region} delete-subnet --subnet-id ${subnet}
    done
  fi

  aws ec2 --region ${region} delete-vpc --vpc-id $vpc
done

###### delete cloud formations stacks ######
stack_names=$(aws --region ${region} cloudformation describe-stacks --query 'Stacks[?Tags[?Key == `ig-ci`]]' | jq -r '.[].StackName')
for stack in ${stack_names}
do
  echo "Deleting stack ${stack}"
  aws --region ${region} cloudformation delete-stack --stack-name ${stack}
done

Rather than eksctl, should we use awk eks?

I tried that but then you need to also create the VPCs, gateways, node groups and everything else, so in the end it's much more complicated than using eksctl.

mauriciovasquezbernal avatar Feb 23 '24 14:02 mauriciovasquezbernal

This looks painful:

Use --vpc-private-subnets & --vpc-public-subnets to avoid creating a new VPC for the cluster

Why don't we go for this option? I mean we will have to create some resources (VPC, gateways, subnets) but that would be one time thing?

mqasimsarfraz avatar Feb 23 '24 14:02 mqasimsarfraz

Why don't we go for this option? I mean we will have to create some resources (VPC, gateways, subnets) but that would be one time thing?

Besides me not wanting to go through all the pain of setting that again:

  • I wanted to have clusters as independent as possible
  • It'll be a new dependency to run the CI, so forks will have to setup it in order to run integration tests

Perhaps I'm being too optimistic that it'll be fixed soon upstream and that we can handle some manual cleanup in the meanwhile.

mauriciovasquezbernal avatar Feb 23 '24 15:02 mauriciovasquezbernal

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

github-actions[bot] avatar Apr 30 '24 01:04 github-actions[bot]

I think this is fixed by #2686. Would you mind confirming @burak-ok ?

mauriciovasquezbernal avatar Apr 30 '24 16:04 mauriciovasquezbernal

I think this is fixed by #2686. Would you mind confirming @burak-ok ?

Correct, this is fixed

burak-ok avatar May 02 '24 08:05 burak-ok