smart_open
smart_open copied to clipboard
Rewrite of s3.Reader class to protect S3 from open range headers (#725)
Motivation
The smart_open.s3.Reader
class left the HTTP range header open (e.g. bytes=0-
). That was done to allow multiple (future) calls to read
to stream bytes from the same HTTP body to avoid duplicating HTTP headers and connections. The open range header causes many S3 servers to internally move the full file, or a large block of the file, from its storage location to a server handling the S3 request. This was demonstrated on Ceph/S3 and SeaweedFS/S3, see issue #725. In the case of a small read against a large file this can quickly overwhelm the S3 filesystem.
Fixes #725.
Work Performed
The class smart_open.s3.Reader
was re-written to ensure the HTTP range headers are always set reasonably to protect the S3 filesystem. The rewrite strikes a balance between optimizing for the open-read-close use case and the open-repeat-read-close use case. See the git issue #725 for a detailed discussion.
Tests
Existing unit tests for smart_open.s3
all pass. A few unit tests were removed for features that were eliminated and a few unit tests were added to cover new features. Some unrelated unit tests were updated because they failed to clean up after themselves.
Checklist
Before you create the PR, please make sure you have:
- [X] Picked a concise, informative and complete title
- [X] Clearly explained the motivation behind the PR
- [X] Linked to any existing issues that your PR will be solving
- [X] Included tests for any new functionality
- [X] Checked that all unit tests pass
Workflow
Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress. Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.
@piskvorky There's a ton of work left for this PR, my recommendation is to remove it from the milestone and deal with it after we release 6.3.0
came across a nice read about S3 and range requests https://blog.limbus-medtec.com/the-aws-s3-denial-of-wallet-amplification-attack-bc5a97cc041d