amazon-s3-find-and-forget Remove S3FS dependency

Remove S3FS dependency

Open matteofigus opened this issue 2 years ago • 0 comments

Description of changes:

This PR removes s3fs to simplify handling communications with S3.

It uses pyarrow's S3FileSystem to read, and boto3 to write back.

The reason I am using pyarrow to read is that it allows us to get rid of s3fs and I haven't noticed differences in terms of performance.

I tried to use pyarrow to write back to s3 too, but unfortunately I don't have an API that gives me back the VersionId of the write. If I would read the latest version using boto3, I would lose confidence that the version I'm reading is the one I just wrote, and therefore I wouldn't be able to perform consistency checks.

In order to remove s3fs, I investigated boto3 and found a (new?) convenient method called upload_fileobj that is smart enough to figure out if doing multipart or not, and I noticed an increase in performance compared to s3fs. The issue is that this method doesn't return the versionID as well, but I managed to find a way to monkey patch the code while we wait for this issue to be resolved. I found some Pull Requests too so perhaps we can see this soon landing in boto.

I ran the acceptance tests as well as some manual tests with very large files (8GB snappy compressed parquet), for which I didn't see any difference in reads, and some performance improvement (3x) in writing.

PR Checklist:

[x] Changelog updated
[x] Unit tests (and integration tests if applicable) provided
[x] All tests pass
[x] Pre-commit checks pass
[x] Debugging code removed
[x] If releasing a new version, have you bumped the version in the main CFN template?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Jul 28 '22 15:07 matteofigus

amazon-s3-find-and-forget amazon-s3-find-and-forget copied to clipboard

Remove S3FS dependency

amazon-s3-find-and-forget
amazon-s3-find-and-forget copied to clipboard