smart_open icon indicating copy to clipboard operation
smart_open copied to clipboard

Rewrite of s3.Reader class to protect S3 from open range headers (#725)

Open davidparks21 opened this issue 2 years ago • 3 comments

Motivation

The smart_open.s3.Reader class left the HTTP range header open (e.g. bytes=0-). That was done to allow multiple (future) calls to read to stream bytes from the same HTTP body to avoid duplicating HTTP headers and connections. The open range header causes many S3 servers to internally move the full file, or a large block of the file, from its storage location to a server handling the S3 request. This was demonstrated on Ceph/S3 and SeaweedFS/S3, see issue #725. In the case of a small read against a large file this can quickly overwhelm the S3 filesystem.

Fixes #725.

Work Performed

The class smart_open.s3.Reader was re-written to ensure the HTTP range headers are always set reasonably to protect the S3 filesystem. The rewrite strikes a balance between optimizing for the open-read-close use case and the open-repeat-read-close use case. See the git issue #725 for a detailed discussion.

Tests

Existing unit tests for smart_open.s3 all pass. A few unit tests were removed for features that were eliminated and a few unit tests were added to cover new features. Some unrelated unit tests were updated because they failed to clean up after themselves.

Checklist

Before you create the PR, please make sure you have:

  • [X] Picked a concise, informative and complete title
  • [X] Clearly explained the motivation behind the PR
  • [X] Linked to any existing issues that your PR will be solving
  • [X] Included tests for any new functionality
  • [X] Checked that all unit tests pass

Workflow

Please avoid rebasing and force-pushing to the branch of the PR once a review is in progress. Rebasing can make your commits look a bit cleaner, but it also makes life more difficult from the reviewer, because they are no longer able to distinguish between code that has already been reviewed, and unreviewed code.

davidparks21 avatar Oct 19 '22 20:10 davidparks21

@piskvorky There's a ton of work left for this PR, my recommendation is to remove it from the milestone and deal with it after we release 6.3.0

mpenkov avatar Nov 29 '22 07:11 mpenkov

came across a nice read about S3 and range requests https://blog.limbus-medtec.com/the-aws-s3-denial-of-wallet-amplification-attack-bc5a97cc041d

ddelange avatar May 01 '24 19:05 ddelange