trino icon indicating copy to clipboard operation
trino copied to clipboard

Add Glacier skipping logic to S3FileSystem

Open ericlgoodman opened this issue 2 years ago • 7 comments

Description

TrinoS3FileSystem contains logic/configuration for skipping Glacier objects found in S3.

This PR adds the missing functionality/configuration from TrinoS3FileSystem and also adds the capability to configure the filesystem to read restored Glacier objects.

Additional context and related issues

Recently, S3 added the RestoreStatus to the result of ListObjectsV2 - see https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-s3-restore-status-s3-glacier-objects-list-api/ for more details.

Release notes

( ) This is not user-visible or docs only and no release notes are required. (x) Release notes are required, please propose a release note for me. () Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

ericlgoodman avatar Jul 12 '23 01:07 ericlgoodman

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

cla-bot[bot] avatar Jul 12 '23 01:07 cla-bot[bot]

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

cla-bot[bot] avatar Jul 13 '23 15:07 cla-bot[bot]

Facing some dependency issues, as it looks like the version of Netty used in the SDK with restoreStatus breaks Apache Arrow:

<!-- Netty 4.1.94.Final breaks Apache Arrow -->
<dep.netty.version>4.1.93.Final</dep.netty.version>

Per this comment, we need to wait for the next release of Apache Arrow.

ericlgoodman avatar Jul 20 '23 20:07 ericlgoodman

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Jan 12 '24 17:01 github-actions[bot]

:wave: @ericlgoodman @findepi @electrum - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.

We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.

mosabua avatar Jan 12 '24 18:01 mosabua

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Feb 05 '24 17:02 github-actions[bot]

Apologies for the delay on following up on this. As @findepi mentioned, putting the skipping logic directly in the file system is problematic for deletes, and we really only want this to apply to Hive file listing. Adding a new listing method just for this is ugly and breaks the abstractions.

Looking at this with fresh eyes, I'm thinking we can implement this by adding a Set<String> tags field to the FileEntry object. Objects could be tagged with s3:glacier and s3:glacierRestored as applicable. Then the Hive connector can have a hive.s3.skip-glacier-objects config in HiveConfig and handle the skipping inside TrinoFileStatusRemoteIterator.

electrum avatar Feb 13 '24 01:02 electrum

Apologies for the delay on following up on this. As @findepi mentioned, putting the skipping logic directly in the file system is problematic for deletes, and we really only want this to apply to Hive file listing. Adding a new listing method just for this is ugly and breaks the abstractions.

Looking at this with fresh eyes, I'm thinking we can implement this by adding a Set<String> tags field to the FileEntry object. Objects could be tagged with s3:glacier and s3:glacierRestored as applicable. Then the Hive connector can have a hive.s3.skip-glacier-objects config in HiveConfig and handle the skipping inside TrinoFileStatusRemoteIterator.

Sounds good to me. Sorry for the delay on this - we've had a lot of reshuffling lately. Planning on having someone on my team pick this up in the next few weeks.

ericlgoodman avatar Feb 27 '24 21:02 ericlgoodman

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

github-actions[bot] avatar Mar 20 '24 17:03 github-actions[bot]

I'm thinking we can implement this by adding a Set<String> tags field to the FileEntry object. Objects could be tagged with s3:glacier and s3:glacierRestored as applicable. Then the Hive connector can have a hive.s3.skip-glacier-objects config in HiveConfig and handle the skipping inside TrinoFileStatusRemoteIterator.

Hi @electrum, I'll be picking this back up. I was wondering how we should go about implementing the skipping logic in TrinoFileStatusRemoteIterator. If we were to pass the filter to there, how would we go about the hasNext() and next() logic? hasNext() would only allow us to see if there is a next object, whereas in next() we could see if the object matches our filter but we wouldn't be able to just call next() again b/c we aren't sure if it hasNext(). Any help would be appreciated, thank you

bangtim avatar Mar 21 '24 15:03 bangtim

Hi @bangtim, take a look at io.trino.plugin.hive.fs.DirectoryListingFilter or com.google.common.collect.AbstractIterator for ideas on how to implement the skipping logic. The basic idea is that the iterator needs to do the filtering and hold onto the next element, which can then be referenced in the hasNext() or next() methods.

electrum avatar Mar 22 '24 20:03 electrum

To be completed via https://github.com/trinodb/trino/pull/21164

ericlgoodman avatar Apr 02 '24 00:04 ericlgoodman