trino Perform S3 directory deletion with batch requests

Description

Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.

The deletion acts under a given prefix my_scheama/my_table/. Just in case that there is an object with the key my_scheama/my_table this PR deletes this object as well.

This PR duplicates partly the existing logic of TrinoS3FileSystem#getFileStatus(Path) and adapts it to the needs of the method:

io.trino.plugin.hive.s3.TrinoS3FileSystem#delete

https://github.com/trinodb/trino/blob/31e7bd6259e3a7b9855d2698be2431a2d615fda6/plugin/trino-hive/src/main/java/io/trino/plugin/hive/s3/TrinoS3FileSystem.java#L446-L476

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Hive, Delta, Iceberg (Lakehouse connectors)

How would you describe this change to a non-technical end user or system administrator?

Speed up the deletion of an S3 "directory" ( a path prefix which corresponds to multiple S3 objects) by using batch deletion requests.

Related issues, pull requests, and links

Fixes #13017

Documentation

(x) No documentation is needed. ( ) Sufficient documentation is included in this PR. ( ) Documentation PR is available with #prnumber. ( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required. (x) Release notes entries required with the following suggested text:

# Delta, Hive, Iceberg
* Ensure the rename/delete effectiveness for S3 directories which appear shallowly like objects

Sep 02 '22 15:09 findinpath

I'm not sure this logic totally makes sense, because an S3 object with Content-Type: "application/octet-stream", with 0 bytes and a key that doesn't end in / (eg: s3://bucket/key) seems like it should be interpreted as an "empty file" and not a directory. I could see the argument for interpreting an S3 object s3://bucket/key/ as a directory, or s3://bucket/key as a "directory" even if an no object with that key exists so long as an "child object" exists (eg: some object with key s3://bucket/key/object)- but when an S3 object actually exists without the trailing slash I think we have to interpret that as a "file" and not directory unless it's appropriately marked with the right content type.

Now with that said, it's important to remember that S3 is not a file system, it's an object store so it doesn't really have directories. So there might be a better option for how to handle delete(Path path, boolean recursive == true). In that case, instead of doing this which recursively generates S3 listing, S3 getObjectMetadata, and S3 deleteObject calls:

for (FileStatus file : listStatus(path)) {
  delete(file.getPath(), true);
}

You could instead do something like (simplified to ignore details):

Iterator<S3ObjectSummary> listings = S3Objects.withPrefix(s3, bucketName, keyFromPath(path) + "/").iterator();
Iterator<String> keys = Iterators.transform(listings, S3ObjectSummary::getKey);
Iterator<List<String>> keyBatches = Iterators.partition(keys, 1000);
while (keyBatches.hasNext()) {
  String[] keysInBatch = keyBatches.next().toArray(String[]::new);
  // TODO: handle MultiObjectDeleteException in case some deletes fail
  s3.deleteObjects(new DeleteObjectsRequest(bucketName).withKeys(keysInBatch));
}

Which will be much more efficient, faster, and completely remove all S3 objects with actual keys starting with the prefix "s3://<bucket>/<path>/".

Sep 02 '22 20:09 pettyjamesm

@pettyjamesm I have addressed your comment.

As you mentioned offline there is definitely room for improvement in io.trino.plugin.hive.s3.TrinoS3FileSystem#rename . I would suggest addressing the refactoring of this method in a separate PR.

Sep 05 '22 11:09 findinpath

I think we probably want to reorganize the logic to check recursive explicitly

I agree. The current state of the code for the delete method is rather hard to follow. I'll do a subsequent refactoring.

@pettyjamesm I've added a set of tests against AWS S3 object storage in order to be able to test the assumptions from the code.

Sep 07 '22 11:09 findinpath

We’re planning to replace TrinoS3FileSystem with a native TrinoFileSystem implementation, which has an API designed for object stores. Iceberg and Delta Lake already use it. Maybe instead of improving cod that we’ll throw away, we should build the new implementation now?

Sep 07 '22 14:09 electrum

@electrum I understand the direction of replacing TrinoS3FileSystem with a native TrinoFileSystem implementation. However, my change is rather punctual (related to the delete method) and starting the endeavour of making a native TrinoFileSystem implementation would take much more time. Even if the bugfix is short to medium lived, I still see a benefit in landing this new approach.

Sep 07 '22 15:09 findinpath

I refactored the deletion algorithm to address the comments. @pettyjamesm please be so kind and take a fresh look over the PR.

Oct 05 '22 14:10 findinpath

After reviewing together with @pettyjamesm the implementation we've came with a simplified approach for the delete(Path, boolean) operation.

The previous approach which was dependent on analyzing metadata, has been replaced with another approach based primarily on listing the objects and in case of dealing with recursive deletes, to perform the deletes in batches.

Oct 06 '22 09:10 findinpath

Rebasing on top of master to make use of the changes from https://github.com/trinodb/trino/pull/14489

Oct 07 '22 18:10 findinpath

Updated the PR from master to use the latest MinIO image for testing which offers increased S3 API compatibility .

Oct 14 '22 15:10 findinpath

the build seems red

Oct 18 '22 07:10 findepi

@findinpath please rebase

Nov 04 '22 15:11 findepi

Rebased to deal with conflicts from master.

Nov 04 '22 15:11 findinpath

Test PR with secrets: https://github.com/trinodb/trino/pull/15098 (@nineinchnick's https://github.com/trinodb/trino/pull/12817 will make these PRs easier)

Also, the first commit is separately being tested here: https://github.com/trinodb/trino/pull/15099

Nov 18 '22 09:11 findepi

@findinpath can you please investigate the CI outcome?

Nov 18 '22 21:11 findepi

@findinpath can you please investigate the CI outcome?

https://github.com/trinodb/trino/actions/runs/3495735811/jobs/5852837911 showcases that the content can't be deleted on presto-ci-test S3 bucket.

Error:  io.trino.plugin.hive.s3.TestTrinoS3FileSystemAwsS3.testDeleteNonRecursivelyNonEmptyDeepPath  Time elapsed: 0.403 s  <<< FAILURE!
java.io.IOException: Failed to delete paths with the prefix path s3://***/test-illegal-delete-non-recursively-deep-path-non-empty-e69bb17a-8e62-49e2-b281-1a84037c5992
...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: EHCHYPMFXNKTEW9C; S3 Extended Request ID: OfGRQM1L2Kv6gX2pQLjTiLWxdmEJCUkHa5nkpa5qrtyXSaqZVlpb9X6R8TgIYP+h8Tm8gW3fFf4=; Proxy: null), S3 Extended Request ID: OfGRQM1L2Kv6gX2pQLjTiLWxdmEJCUkHa5nkpa5qrtyXSaqZVlpb9X6R8TgIYP+h8Tm8gW3fFf4=

CI hit https://github.com/trinodb/trino/issues/13199 2 times.

Nov 21 '22 05:11 findinpath

Rebased on top of master in order to cope with the fact that the hive AWS tests are using now different secrets, reason why I needed to correspondingly change the S3 bucket used for testing to trino-ci-test. The only file changed after the rebasing was ci.yml

@findepi / @ebyhr pls update the clone PR.

Nov 21 '22 15:11 findinpath

Updated the mirror PR #15098

Nov 21 '22 22:11 ebyhr

CI hit https://github.com/trinodb/trino/issues/13199 for ci / pt (default, suite-delta-lake-databricks113.

Nov 22 '22 05:11 findinpath

ci / check-commits (pull_request) Failing after 37s

Nov 28 '22 10:11 findepi

Rebasing on top of master to address check-commits issue.

Nov 28 '22 10:11 findinpath

/test-with-secrets sha=12265ea06d6ac9e006bb56973cb23cc54a8e9d2d

Nov 30 '22 15:11 findepi

(rebased to fix test-with-secrets job)

Dec 01 '22 09:12 findepi

/test-with-secrets sha=07e689f46a2a40fe6e618b02a0eae4a2ff492b76

Dec 01 '22 09:12 findepi

Run with secrets: https://github.com/trinodb/trino/actions/runs/3591234020

Dec 01 '22 09:12 findepi

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3591234020

Dec 01 '22 09:12 github-actions[bot]

@findinpath what were the changes in the last two force-pushes (https://github.com/trinodb/trino/pull/13974#event-7930164952, https://github.com/trinodb/trino/pull/13974#event-7932782205)?

Dec 05 '22 12:12 findepi

I have changed the setup() and tearDown() methods from TestTrinoS3FileSystemMinio so that there are no resources allocated in the constructor.

Dec 05 '22 13:12 findinpath

/test-with-secrets sha=ba2d0b67d0890174483733fe48012373c7478ead

Dec 05 '22 19:12 findepi

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/3623643550

Dec 05 '22 22:12 github-actions[bot]

/test-with-secrets sha=62a44129acd1d4db0e8630062e3c01901bac0882

Dec 09 '22 11:12 findepi

trino trino copied to clipboard

Perform S3 directory deletion with batch requests

Description

Related issues, pull requests, and links

Documentation

Release notes

trino
trino copied to clipboard