aws-nuke icon indicating copy to clipboard operation
aws-nuke copied to clipboard

[BUG] AWS nuke consumes serious amount of memory

Open rbalman opened this issue 4 years ago • 13 comments

Scenario

When I was running the aws-nuke in my AWS CodeBuild project, AWS nuking process was constantly being killed by the CodeBuild during data gathering phase where aws-nuke scans all the list of resources that will be deleted/filtered. Then I ran aws-nuke in my local machine to find the cause, what I found was surprising, CodeBuild was taking a lot of memory(5GB RES and 9.7G virtual) ☠️ and memory consumption increases whenever aws-nuke has to pause for meta gathering for any resource.

Problem

Because of this issue, I am not able to run the aws-nuke process in 3GB CodeBuild instance. I have no choice than to use 7GB or 15GB CodeBuild instance.

Screenshots

CodeBuild Killing aws-nuke process Screen Shot 2019-09-30 at 5 06 17 PM

AWS Nuke consuming serious amount of memory Screen Shot 2019-09-30 at 2 30 20 PM

rbalman avatar Sep 30 '19 11:09 rbalman

We used to run it on Lambda with as little as 128MB memory in the past but with the latest version it only runs successfully with 3008MB (maximum for Lambda). With 1024 it was still crashing with an OOM error. I guess it depends on the number of resources it collects in the account but still 3GB seems like a lot.

mseiwald avatar Dec 12 '19 07:12 mseiwald

My guess is that this happens, because we started to store the whole AWS response in order to make the code more readable when using Properties().

So we need to check, if that is really the case.

svenwltr avatar Dec 16 '19 09:12 svenwltr

The amount of memory this tool is consuming is ridiculous, I run this on an AWS account that is in serious use by several developers as a sandbox and got OOM on a 32GB md5xlarge EC2 instance. Any chance to limit memory usage?

image

pawelros avatar May 24 '20 20:05 pawelros

To fix this we would have to go through all resources and reduce the stored data to a minimum. This means only put the required data to the resource and not the whole AWS response. This still requires to store all resources in memory due the aws-nuke behaviour to print all resources first and deleting them afterwards, but it could drastically reduce the memory consumption.

svenwltr avatar May 25 '20 14:05 svenwltr

An option to save AWS responses into file chunks and process them one by one rather than all at once could solve the problem.

pawelros avatar May 26 '20 09:05 pawelros

@svenwltr I did some investigation and run aws-nuke in a loop against each available resource type separately. S3Object target itself run out of memory consuming more than 30GB of RAM. This makes me suspect that you load into memory full S3 objects with their entire binary content. Is that true? If so by fixing it, we could IMO drastically improve memory consumption.

On a side note, running aws-nuke against all targets took 17 minutes until it crashed due to OOM. When I excluded S3Object in the config, the entire run took only 17 seconds, 60 times faster. So IMO there is seriously something wrong with that target.

pawelros avatar May 26 '20 11:05 pawelros

Hello @pawelros.

An option to save AWS responses into file chunks and process them one by one rather than all at once could solve the problem.

This is true, but it would be quite a big change.

This makes me suspect that you load into memory full S3 objects with their entire binary content. Is that true?

No it is not. The S3Object resource uses the ListObjectVersions API, which does not return the actual contents.

Also the S3Object resource does not store more data than necessary.

When I excluded S3Object in the config, the entire run took only 17 seconds, 60 times faster. So IMO there is seriously something wrong with that target.

The thing with S3 Objects is that there might be very very many of them. This is a common problem with this resource, but I do not see an easy ways to solve it. If you do not need to delete specific S3Objects, I suggest to exclude S3Object from scanning. It is still possible to delete S3 Buckets without going through all Objects.

svenwltr avatar May 26 '20 11:05 svenwltr

I have configured aws-nuke in T3.large aws VM and getting the same memory issue --> fatal error: runtime: out of memory.Then I have tried appending s3bucket and Lambda service under targets in nuke-config.yaml file and it worked fine.OOM issue is causing while describing every services I guess. Is there any fix for this?

Johnsonkdavid avatar Sep 16 '20 06:09 Johnsonkdavid

aws-nuke used to crash - "fatal error: runtime: out of memory" It seems that there were way too many S3Objects (2M+), which was the issue, it was solved by aws-nuke -e S3Bucket,S3Object

danielrankov-mm avatar Oct 08 '20 21:10 danielrankov-mm

@danielrankov-mm - yes, it will describe each and every s3 objects. I tried to exclude them in aws-nuke.conf file and it was working fine.

Johnsonkdavid avatar Oct 09 '20 08:10 Johnsonkdavid

I've just run quay.io/rebuy/aws-nuke:latest on a single region of an AWS account. The tool reported that it would remove 589 resources. In preparation to do so it consumed more than 26 GB of RAM. How is this tool still useful for anyone with more than a handful of resource to be deleted? I used to run this against an entire account, all regions, without issue (circa 2019). Now it cannot even delete all of the resources in a single region!? Is anyone working to fix this?

What is the value of storing the whole response from the List*/Describe* calls if all that is ever used is the InstanceID (or similar)? Am I using the tool improperly? Rather than hold the responses in memory could they be written to a key value database like Badger? OR could just the key attributes of a resource be held (ARN / tags for example)? Happy to make a pull request and join others who are working to fix this.

jpbarto avatar Nov 08 '21 00:11 jpbarto

Is anyone working to fix this?

I do not think anyone is currently working on this.

What is the value of storing the whole response from the List*/Describe* calls if all that is ever used is the InstanceID (or similar)?

There is not value in doing this. It was simply the easiest way to implement it this way and we did not experienced a problem with this approach in the beginning.

How is this tool still useful for anyone with more than a handful of resource to be deleted?

It appears that this is dependent on the kind of resources in the account, since every resource in programmed separately. It might be that there are a few resources that store the whole Describe* output, that are not used by everyone or that someone has a notably large amount of these.

Rather than hold the responses in memory could they be written to a key value database like Badger?

I do not think that is necessary. Or at least we should not go this way, before we did not properly cleaned up all resources.


So the action plan would be to go through all ~368 resources and only store the values in the struct, that are needed by the Remove, Properties, Filter and String methods.

  • good example: https://github.com/rebuy-de/aws-nuke/blob/b5ccc0056f070379678264ccae7c88ddf8b5dfa5/resources/ec2-vpc-peering-connections.go#L10-L14
  • bad example: https://github.com/rebuy-de/aws-nuke/blob/b5ccc0056f070379678264ccae7c88ddf8b5dfa5/resources/ec2-tgw.go#L11-L14

We do not have time to do this.

svenwltr avatar Nov 09 '21 12:11 svenwltr

In case this helps anyone, I maxed out a m5.4xlarge EC2 instance, even with S3Object excluded.

After running with --verbose flag, I found that it was a CloudTrail with a Multi-region trail that was the culprit.

Excluding CloudTrail and I can run on my laptop.

ericpardee avatar May 17 '24 22:05 ericpardee