iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

fix: fixing tests to work with s3Express

Open stubz151 opened this issue 1 year ago • 4 comments

Why am I doing this

The following cases exist that will cause the integration tests to fail when pointed at an s3 express bucket:

  • S3 Express needs a delimiter at the end of the prefix so updating the tests to cater for that

  • File ACLs aren’t supported by S3 Express

  • S3 Express doesn't have versioning support

  • KMS/custom encryption isn’t supported by S3 Express.

  • Multi-part uploads fail with ‘This bucket does not support Content Md5 header.’ error.

What have I done

  • updated the prefix to have a "/" this cater for s3 general and express

  • disabled the tests that fail for s3 express that test features currently not supported

  • used the suffix "x-s3 to detect express buckets

  • added method to clean s3 express buckets

How have I tested it

  • Ran s3 integration tests targeting general s3 bucket on my account

  • Ran s3 integration tests targeting s3 express bucket on my account

stubz151 avatar Aug 27 '24 16:08 stubz151

One comment, also I had some concerns on the previous attempt of fixing the tests.

https://github.com/apache/iceberg/pull/10292/files whats the take in there ?

cc @steveloughran for review

-> With regards to HadoopFileIO, does that refer to this? https://iceberg.apache.org/docs/1.5.0/aws/#hadoop-s3a-filesystem

I can add the backslash to the ListPrefix request. Do we want to do it for all s3 buckets(general + directory) this will be a breaking change, or would we prefer to add it in like I have done with the tests?

stubz151 avatar Aug 28 '24 14:08 stubz151

I can add the backslash to the ListPrefix request. Do we want to do it for all s3 buckets(general + directory) this will be a breaking change, or would we prefer to add it in like I have done with the tests?

If you don't do it in production then things which omit to pass it down will break. And i'm confident that nothing is passing down a path like

listObjects("path/p1") expecting it to find path/p1/* and other matches like path/p1Filename, else HadoopFileIO list prefix would be failing and people would have noticed.

steveloughran avatar Aug 30 '24 11:08 steveloughran

I can add the backslash to the ListPrefix request. Do we want to do it for all s3 buckets(general + directory) this will be a breaking change, or would we prefer to add it in like I have done with the tests?

If you don't do it in production then things which omit to pass it down will break. And i'm confident that nothing is passing down a path like

listObjects("path/p1") expecting it to find path/p1/* and other matches like path/p1Filename, else HadoopFileIO list prefix would be failing and people would have noticed.

Have added it for express, let me know what you think :)

stubz151 avatar Sep 02 '24 13:09 stubz151

Just reset the history this pr got a bit messy sorry.

stubz151 avatar Sep 02 '24 13:09 stubz151

This mostly looks good to me now, just a few very nit comments. And I think we should update the aws.md about using directory buckets. But that can also be a separated PR, up to you.

jackye1995 avatar Oct 23 '24 16:10 jackye1995

Sorry I did a small javadoc style edit and it caused some style failures, I pushed the fix myself...

jackye1995 avatar Oct 23 '24 18:10 jackye1995

AccessPoints being not supported

Yes, today users configure features like cross bucket replication also through the access point configuration. The access point naming is no longer fitting its actual use case. Maybe we should rename it in 2.0.

jackye1995 avatar Oct 24 '24 03:10 jackye1995

sometime i should update the bit of the iceberg docs which deprecates the S3A connector, to highlight what it offers, adding large file upload to S3 Express stores to the list, along with "hardened error handling, metrics, audit logs, advanced authentication mechanisms and more"

steveloughran avatar Oct 28 '24 13:10 steveloughran