trino
trino copied to clipboard
Perform S3 directory deletion with batch requests
Description
Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.
The deletion acts under a given prefix my_scheama/my_table/
.
Just in case that there is an object with the key my_scheama/my_table
this PR deletes this object as well.
This PR duplicates partly the existing logic of TrinoS3FileSystem#getFileStatus(Path)
and adapts it to the needs of the method:
-
io.trino.plugin.hive.s3.TrinoS3FileSystem#delete
https://github.com/trinodb/trino/blob/31e7bd6259e3a7b9855d2698be2431a2d615fda6/plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java#L446-L476
Is this change a fix, improvement, new feature, refactoring, or other?
Fix
Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)
Hive, Delta, Iceberg (Lakehouse connectors)
How would you describe this change to a non-technical end user or system administrator?
Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.
Related issues, pull requests, and links
Fixes #13017
Documentation
(x) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required. (x) Release notes entries required with the following suggested text:
# Delta, Hive, Iceberg
* Ensure the rename/delete effectiveness for S3 directories which appear shallowly like objects
I'm not sure this logic totally makes sense, because an S3 object with Content-Type: "application/octet-stream", with 0 bytes and a key that doesn't end in /
(eg: s3://bucket/key
) seems like it should be interpreted as an "empty file" and not a directory. I could see the argument for interpreting an S3 object s3://bucket/key/
as a directory, or s3://bucket/key
as a "directory" even if an no object with that key exists so long as an "child object" exists (eg: some object with key s3://bucket/key/object
)- but when an S3 object actually exists without the trailing slash I think we have to interpret that as a "file" and not directory unless it's appropriately marked with the right content type.
Now with that said, it's important to remember that S3 is not a file system, it's an object store so it doesn't really have directories. So there might be a better option for how to handle delete(Path path, boolean recursive == true)
. In that case, instead of doing this which recursively generates S3 listing, S3 getObjectMetadata, and S3 deleteObject calls:
for (FileStatus file : listStatus(path)) {
delete(file.getPath(), true);
}
You could instead do something like (simplified to ignore details):
Iterator<S3ObjectSummary> listings = S3Objects.withPrefix(s3, bucketName, keyFromPath(path) + "/").iterator();
Iterator<String> keys = Iterators.transform(listings, S3ObjectSummary::getKey);
Iterator<List<String>> keyBatches = Iterators.partition(keys, 1000);
while (keyBatches.hasNext()) {
String[] keysInBatch = keyBatches.next().toArray(String[]::new);
// TODO: handle MultiObjectDeleteException in case some deletes fail
s3.deleteObjects(new DeleteObjectsRequest(bucketName).withKeys(keysInBatch));
}
Which will be much more efficient, faster, and completely remove all S3 objects with actual keys starting with the prefix "s3://<bucket>/<path>/"
.
@pettyjamesm I have addressed your comment.
As you mentioned offline there is definitely room for improvement in io.trino.plugin.hive.s3.TrinoS3FileSystem#rename
. I would suggest addressing the refactoring of this method in a separate PR.
I think we probably want to reorganize the logic to check recursive explicitly
I agree. The current state of the code for the delete
method is rather hard to follow. I'll do a subsequent refactoring.
@pettyjamesm I've added a set of tests against AWS S3 object storage in order to be able to test the assumptions from the code.
We’re planning to replace TrinoS3FileSystem
with a native TrinoFileSystem
implementation, which has an API designed for object stores. Iceberg and Delta Lake already use it. Maybe instead of improving cod that we’ll throw away, we should build the new implementation now?
@electrum I understand the direction of replacing TrinoS3FileSystem
with a native TrinoFileSystem
implementation.
However, my change is rather punctual (related to the delete
method) and starting the endeavour of making a native TrinoFileSystem
implementation would take much more time.
Even if the bugfix is short to medium lived, I still see a benefit in landing this new approach.
I refactored the deletion algorithm to address the comments. @pettyjamesm please be so kind and take a fresh look over the PR.
After reviewing together with @pettyjamesm the implementation we've came with a simplified approach for the delete(Path, boolean)
operation.
The previous approach which was dependent on analyzing metadata, has been replaced with another approach based primarily on listing the objects and in case of dealing with recursive deletes, to perform the deletes in batches.
Rebasing on top of master
to make use of the changes from https://github.com/trinodb/trino/pull/14489
Updated the PR from master
to use the latest MinIO image for testing which offers increased S3 API compatibility .
the build seems red
@findinpath please rebase
Rebased to deal with conflicts from master
.
Test PR with secrets: https://github.com/trinodb/trino/pull/15098 (@nineinchnick's https://github.com/trinodb/trino/pull/12817 will make these PRs easier)
Also, the first commit is separately being tested here: https://github.com/trinodb/trino/pull/15099
@findinpath can you please investigate the CI outcome?
@findinpath can you please investigate the CI outcome?
https://github.com/trinodb/trino/actions/runs/3495735811/jobs/5852837911 showcases that the content can't be deleted on presto-ci-test
S3 bucket.
Error: io.trino.plugin.hive.s3.TestTrinoS3FileSystemAwsS3.testDeleteNonRecursivelyNonEmptyDeepPath Time elapsed: 0.403 s <<< FAILURE!
java.io.IOException: Failed to delete paths with the prefix path s3://***/test-illegal-delete-non-recursively-deep-path-non-empty-e69bb17a-8e62-49e2-b281-1a84037c5992
...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EHCHYPMFXNKTEW9C; S3 Extended Request ID: OfGRQM1L2Kv6gX2pQLjTiLWxdmEJCUkHa5nkpa5qrtyXSaqZVlpb9X6R8TgIYP+h8Tm8gW3fFf4=; Proxy: null), S3 Extended Request ID: OfGRQM1L2Kv6gX2pQLjTiLWxdmEJCUkHa5nkpa5qrtyXSaqZVlpb9X6R8TgIYP+h8Tm8gW3fFf4=
CI hit https://github.com/trinodb/trino/issues/13199 2 times.
Rebased on top of master
in order to cope with the fact that the hive AWS tests are using now different secrets, reason why I needed to correspondingly change the S3 bucket used for testing to trino-ci-test
.
The only file changed after the rebasing was ci.yml
@findepi / @ebyhr pls update the clone PR.
Updated the mirror PR #15098
CI hit https://github.com/trinodb/trino/issues/13199 for ci / pt (default, suite-delta-lake-databricks113
.
ci / check-commits (pull_request) Failing after 37s
Rebasing on top of master
to address check-commits
issue.
/test-with-secrets sha=12265ea06d6ac9e006bb56973cb23cc54a8e9d2d
(rebased to fix test-with-secrets job)
/test-with-secrets sha=07e689f46a2a40fe6e618b02a0eae4a2ff492b76
Run with secrets: https://github.com/trinodb/trino/actions/runs/3591234020
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3591234020
@findinpath what were the changes in the last two force-pushes (https://github.com/trinodb/trino/pull/13974#event-7930164952, https://github.com/trinodb/trino/pull/13974#event-7932782205)?
I have changed the setup()
and tearDown()
methods from TestTrinoS3FileSystemMinio
so that there are no resources allocated in the constructor.
/test-with-secrets sha=ba2d0b67d0890174483733fe48012373c7478ead
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3623643550
/test-with-secrets sha=62a44129acd1d4db0e8630062e3c01901bac0882