aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

`aws s3 rm` should use batch deletes

Open bchecketts opened this issue 8 years ago • 8 comments

When running the command aws s3 rm --recursive s3://bucketname/path/, I expect it to use batch object deletion to delete the files quickly with the fewest requests. It appears to be deleting files one at a time.

  $ aws --version
  aws-cli/1.11.13 Python/3.5.2 Linux/4.4.0-1041-aws botocore/1.4.70

To test: Create one bucket with several files. Then sync that bucket to another bucket (this is the bucket we will delete everything from)

aws s3 sync s3://source-bucket/ s3://bucket-to-delete/

Then delete the contents of bucket-to-delete: aws s3 rm --recursive s3://bucket-to-delete/

Notice that it lists each file to delete sequentially.

Re-sync from source-bucket to bucket-to-delete and re-delete things with --debug to get lots of detail. Save detail to /tmp/out

aws s3 rm --debug --recursive s3://bucket-to-delete/ 2>&1 |tee /tmp/out

Then inspect /tmp/out for HTTP requests to confirm the DELETE method was called multiple times, one for each object:

$ grep HTTP /tmp/out
2018-02-22 20:22:51,694 - MainThread - botocore.auth - DEBUG - HTTP request method: GET
2018-02-22 20:22:52,065 - Thread-4 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,227 - Thread-6 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,248 - Thread-8 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,329 - Thread-9 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,390 - Thread-4 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,410 - Thread-11 - botocore.auth - DEBUG - HTTP request method: DELETE
2018-02-22 20:22:52,553 - Thread-4 - botocore.auth - DEBUG - HTTP request method: DELETE

Deleting portions of large buckets could be done much faster, with much less network overhead if the CLI tool would use batch deletes according to https://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html

bchecketts avatar Feb 22 '18 20:02 bchecketts

Yeah that would definitely be faster. Marking as an enhancement. Thanks for bringing it up!

JordonPhillips avatar Feb 22 '18 23:02 JordonPhillips

Confirmed this is still the case with awscli v1.14.45 (previous output was from 1.11.13)

 ~/.local/bin/aws --version
aws-cli/1.14.45 Python/2.7.12 Linux/4.4.0-1041-aws botocore/1.8.49

grep HTTP /tmp/out.new
2018-02-23 01:41:55,982 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): s3.amazonaws.com
2018-02-23 01:41:56,155 - MainThread - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "GET /deleteme-20180222a?prefix=&encoding-type=url HTTP/1.1" 200 None
2018-02-23 01:41:56,243 - ThreadPoolExecutor-0_0 - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (1): s3.amazonaws.com
2018-02-23 01:41:56,332 - ThreadPoolExecutor-0_1 - botocore.vendored.requests.packages.urllib3.connectionpool - INFO - Starting new HTTPS connection (2): s3.amazonaws.com
2018-02-23 01:41:56,333 - ThreadPoolExecutor-0_0 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/1.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,497 - ThreadPoolExecutor-0_2 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/4.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,517 - ThreadPoolExecutor-0_0 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/7.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,557 - ThreadPoolExecutor-0_1 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/2.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,559 - ThreadPoolExecutor-0_3 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/5.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,618 - ThreadPoolExecutor-0_4 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/6.pdf HTTP/1.1" 204 0
2018-02-23 01:41:56,639 - ThreadPoolExecutor-0_2 - botocore.vendored.requests.packages.urllib3.connectionpool - DEBUG - "DELETE /deleteme-20180222a/3.pdf HTTP/1.1" 204 0

bchecketts avatar Feb 23 '18 01:02 bchecketts

Is there any update on this? We suffer every time we have to delete a bucket with millions of files.

ariasjose avatar Jul 28 '22 16:07 ariasjose

Is there any update on this? We suffer every time we have to delete a bucket with millions of files.

still suffering.

koziez avatar Oct 12 '22 13:10 koziez

+1

I also had to use 'nice' and 'cpulimit' to prevent the EC2 I was running it in from overloading.

sudo apt install cpulimit
/usr/bin/cpulimit -q -b -c 1 -e aws -l 30
nice aws s3 rm --quiet s3://my.bucket.name --recursive

I had to resort to using the cli because the web browser tab crashed while I was attempting the same 'Empty Bucket' command overnight :(

Does anyone have a better way to empty a bucket?

Paully

plittlefield avatar Apr 14 '23 11:04 plittlefield

+1

I also had to use 'nice' and 'cpulimit' to prevent the EC2 I was running it in from overloading.

sudo apt install cpulimit
/usr/bin/cpulimit -q -b -c 1 -e aws -l 30
nice aws s3 rm --quiet s3://my.bucket.name --recursive

I had to resort to using the cli because the web browser tab crashed while I was attempting the same 'Empty Bucket' command overnight :(

Does anyone have a better way to empty a bucket?

Paully

best way so far for me has been a glue job using purge https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-purge_s3_path

ariasjose avatar Apr 14 '23 15:04 ariasjose

I have solved it by creating a Lifecycle Configuration in the bucket to delete all objects and markers after 1 day.

plittlefield avatar Apr 15 '23 10:04 plittlefield

+1

I also had to use 'nice' and 'cpulimit' to prevent the EC2 I was running it in from overloading.

sudo apt install cpulimit
/usr/bin/cpulimit -q -b -c 1 -e aws -l 30
nice aws s3 rm --quiet s3://my.bucket.name --recursive

I had to resort to using the cli because the web browser tab crashed while I was attempting the same 'Empty Bucket' command overnight :(

Does anyone have a better way to empty a bucket?

Paully

I typically use below bash script

function s3_batch_delete(){
    # This function deletes files from an S3 bucket based on a specified prefix.
    # Arguments:
    #   $1: The bucket name from which files will be deleted.
    #   $2: The prefix for the files to be deleted. The prefix should not start with a '/'.
    #   $3: The AWS profile to use for authentication. The profile is stored in the ~/.aws/credentials file.

    # List the keys we want to delete and save them in the keysToDelete.txt file.
    aws s3api list-objects-v2 --output text --bucket "${1}" --prefix "${2}" --query 'Contents[].[Key]' --profile "${3}" > keysToDelete.txt

    # Delete the keys in keysToDelete.txt in batches of 1000.
    # We use -P$(nproc) to run multiple batches in parallel over all the available logical cores.
    # To handle longer paths, we adjust the --max-chars setting of xargs to 90% of the maximum argument size allowed by the platform.
    max_arg=$(echo $(getconf ARG_MAX)*0.90/1 | bc)
    cat keysToDelete.txt | xargs -P$(nproc) -n1000 --max-chars="$max_arg" bash -c 'aws s3api delete-objects --bucket '"${1}"' --profile '"${3}"' --delete "Objects=[$(printf "{Key=%q}," "$@")]" >> deletedKeysAndVersionOfDeleteMarker.txt' _ 
    cat deletedKeysAndVersionOfDeleteMarker.txt && rm deletedKeysAndVersionOfDeleteMarker.txt

    # Remove the file with keys after the files have been wiped from S3.
    rm keysToDelete.txt
}

It can be used in below fashion and issues multiple batch deletes, each deleting a 1000 files in one API call, in parallel over all the available logical cores. s3_batch_delete bucketName prefix profileName

BrendBraeckmans avatar Jan 23 '24 19:01 BrendBraeckmans