s3 sync: files with excluded prefixes checked client side
Describe the bug
We sync files from locally to the root of an S3 bucket. There is an existing folder in that same bucket, which we --exclude for this specific sync - that folder has over 1.800.000 objects.
The sync takes about 16 minutes - while doing nothing other than listing all objects including those from the mentioned excluded folder and checking against the exclude rules.
Expected Behavior
The --exclude option should lead to the excluded paths to be excluded during the call to list the S3 objects.
This, way the exclude check would not need to be performed client-side, but rather server-side.
Current Behavior
The sync lists ALL objects in the entire bucket to then decide on the excludes client-side.
Reproduction Steps
- Create bucket
test-bucket(choose distinct name) - Add folder
testwith many random files (e.g. millions) - Run sync to the root of that folder, excluding the
testfolderaws s3 sync --delete --debug . s3://test-bucket --exclude "test/*"
Possible Solution
The S3 ListObjectsV2 endpoint that is used does not seem to expose filter options.
Additional Information/Context
No response
CLI version used
2.9.19
Environment details (OS name and version, etc.)
Ubuntu 22.04.1 (on GitHub Actions)
These are some really good ideas. Thank you for writing this up @michaeltremeer We'll talk internally about this and update you accordingly. cc @silvanocerza
We're working on a pipeline checkpointing feature that will allow to set breakpoints at every component of a pipeline: https://github.com/deepset-ai/haystack/issues/8972