trino
                                
                                
                                
                                    trino copied to clipboard
                            
                            
                            
                        Add Glacier skipping logic to S3FileSystem
Description
TrinoS3FileSystem contains logic/configuration for skipping Glacier objects found in S3.
This PR adds the missing functionality/configuration from TrinoS3FileSystem and also adds the capability to configure the filesystem to read restored Glacier objects.
Additional context and related issues
Recently, S3 added the RestoreStatus to the result of ListObjectsV2 - see https://aws.amazon.com/about-aws/whats-new/2023/06/amazon-s3-restore-status-s3-glacier-objects-list-api/ for more details.
Release notes
( ) This is not user-visible or docs only and no release notes are required. (x) Release notes are required, please propose a release note for me. () Release notes are required, with the following suggested text:
# Section
* Fix some things. ({issue}`issuenumber`)
                                    
                                    
                                    
                                
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla
Facing some dependency issues, as it looks like the version of Netty used in the SDK with restoreStatus breaks Apache Arrow:
<!-- Netty 4.1.94.Final breaks Apache Arrow -->
<dep.netty.version>4.1.93.Final</dep.netty.version>
Per this comment, we need to wait for the next release of Apache Arrow.
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
:wave: @ericlgoodman @findepi @electrum - this PR has become inactive. We hope you are still interested in working on it. Please let us know, and we can try to get reviewers to help with that.
We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks.
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
Apologies for the delay on following up on this. As @findepi mentioned, putting the skipping logic directly in the file system is problematic for deletes, and we really only want this to apply to Hive file listing. Adding a new listing method just for this is ugly and breaks the abstractions.
Looking at this with fresh eyes, I'm thinking we can implement this by adding a Set<String> tags field to the FileEntry object. Objects could be tagged with s3:glacier and s3:glacierRestored as applicable. Then the Hive connector can have a hive.s3.skip-glacier-objects config in HiveConfig and handle the skipping inside TrinoFileStatusRemoteIterator.
Apologies for the delay on following up on this. As @findepi mentioned, putting the skipping logic directly in the file system is problematic for deletes, and we really only want this to apply to Hive file listing. Adding a new listing method just for this is ugly and breaks the abstractions.
Looking at this with fresh eyes, I'm thinking we can implement this by adding a
Set<String> tagsfield to theFileEntryobject. Objects could be tagged withs3:glacierands3:glacierRestoredas applicable. Then the Hive connector can have ahive.s3.skip-glacier-objectsconfig inHiveConfigand handle the skipping insideTrinoFileStatusRemoteIterator.
Sounds good to me. Sorry for the delay on this - we've had a lot of reshuffling lately. Planning on having someone on my team pick this up in the next few weeks.
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua
I'm thinking we can implement this by adding a
Set<String> tagsfield to theFileEntryobject. Objects could be tagged withs3:glacierands3:glacierRestoredas applicable. Then the Hive connector can have ahive.s3.skip-glacier-objectsconfig inHiveConfigand handle the skipping insideTrinoFileStatusRemoteIterator.
Hi @electrum, I'll be picking this back up. I was wondering how we should go about implementing the skipping logic in TrinoFileStatusRemoteIterator. If we were to pass the filter to there, how would we go about the hasNext() and next() logic? hasNext() would only allow us to see if there is a next object, whereas in next() we could see if the object matches our filter but we wouldn't be able to just call next() again b/c we aren't sure if it hasNext(). Any help would be appreciated, thank you
Hi @bangtim, take a look at io.trino.plugin.hive.fs.DirectoryListingFilter or com.google.common.collect.AbstractIterator for ideas on how to implement the skipping logic. The basic idea is that the iterator needs to do the filtering and hold onto the next element, which can then be referenced in the hasNext() or next() methods.
To be completed via https://github.com/trinodb/trino/pull/21164