amazon-s3-find-and-forget
amazon-s3-find-and-forget copied to clipboard
Remove S3FS dependency
Description of changes:
This PR removes s3fs to simplify handling communications with S3.
It uses pyarrow's S3FileSystem to read, and boto3 to write back.
The reason I am using pyarrow to read is that it allows us to get rid of s3fs and I haven't noticed differences in terms of performance.
I tried to use pyarrow to write back to s3 too, but unfortunately I don't have an API that gives me back the VersionId of the write. If I would read the latest version using boto3, I would lose confidence that the version I'm reading is the one I just wrote, and therefore I wouldn't be able to perform consistency checks.
In order to remove s3fs, I investigated boto3 and found a (new?) convenient method called upload_fileobj that is smart enough to figure out if doing multipart or not, and I noticed an increase in performance compared to s3fs. The issue is that this method doesn't return the versionID as well, but I managed to find a way to monkey patch the code while we wait for this issue to be resolved. I found some Pull Requests too so perhaps we can see this soon landing in boto.
I ran the acceptance tests as well as some manual tests with very large files (8GB snappy compressed parquet), for which I didn't see any difference in reads, and some performance improvement (3x) in writing.
PR Checklist:
- [x] Changelog updated
- [x] Unit tests (and integration tests if applicable) provided
- [x] All tests pass
- [x] Pre-commit checks pass
- [x] Debugging code removed
- [x] If releasing a new version, have you bumped the version in the main CFN template?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.