aws-cli icon indicating copy to clipboard operation
aws-cli copied to clipboard

s3 sync: files with excluded prefixes checked client side

Open timmeinerzhagen opened this issue 2 years ago • 1 comments

Describe the bug

We sync files from locally to the root of an S3 bucket. There is an existing folder in that same bucket, which we --exclude for this specific sync - that folder has over 1.800.000 objects.

The sync takes about 16 minutes - while doing nothing other than listing all objects including those from the mentioned excluded folder and checking against the exclude rules.

Expected Behavior

The --exclude option should lead to the excluded paths to be excluded during the call to list the S3 objects.

This, way the exclude check would not need to be performed client-side, but rather server-side.

Current Behavior

The sync lists ALL objects in the entire bucket to then decide on the excludes client-side.

Reproduction Steps

  1. Create bucket test-bucket (choose distinct name)
  2. Add folder test with many random files (e.g. millions)
  3. Run sync to the root of that folder, excluding the test folder
    aws s3 sync --delete --debug . s3://test-bucket --exclude "test/*"
    

Possible Solution

The S3 ListObjectsV2 endpoint that is used does not seem to expose filter options.

Additional Information/Context

No response

CLI version used

2.9.19

Environment details (OS name and version, etc.)

Ubuntu 22.04.1 (on GitHub Actions)

timmeinerzhagen avatar Feb 09 '23 03:02 timmeinerzhagen

These are some really good ideas. Thank you for writing this up @michaeltremeer We'll talk internally about this and update you accordingly. cc @silvanocerza

vblagoje avatar Sep 04 '24 09:09 vblagoje

We're working on a pipeline checkpointing feature that will allow to set breakpoints at every component of a pipeline: https://github.com/deepset-ai/haystack/issues/8972

julian-risch avatar Apr 04 '25 13:04 julian-risch