boto3
boto3 copied to clipboard
Add support for streaming (transfer-encoding: chunked) v4 uploads
While this issue will probably be resolved in botocore, I wanted to file it here as it seems issues filed against botocore don't get much attention.
The thrust of it is that AWS S3 supports streaming uploads with v4 signer (http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html), however, boto computes the entire content hash upfront. There is no way to opt into the streaming v4 upload.
For an upload that originates from another stream, this issue means buffering on disk or in memory to compute the content hash before uploading to S3. It'd be great to expose streaming v4 uploads to the consumers.
The issue against botocore is https://github.com/boto/botocore/issues/995
I wonder how much of an issue this is in practice for boto3 users given the features in yesterday's release (1.4.0). Using the high level upload_fileobj and family of methods, if your file is over a multipart threshold (default of 8MB), we'll automatically switch to multipart uploads. You can also get your parts being uploaded in parallel.
This means we don't actually need to read the entire file first. For streams, we'll buffer up to the multipart threshold and then send the file in 8MB parts. The most we need to read for any given upload part is then 8MB (vs. the whole file).
An additional benefit of upload_fileobj() is that you also don't need to know the file length in advance. Because we're using multipart uploads, you can give it any non-seekable stream and it will just work.
That being said, I'm not opposed to supporting streaming sigv4. I just think if we are going to support it, we should have good use case driving the feature, and with the new upload_fileobj, I'm not sure if there's any benefits.
What are your thoughts?
@jamesls That sounds like a useful improvement of the boto API. I see a couple of points where a chunked upload may still be preferred.
In my particular use case data is copied from an object store provider to another, so it comes from an HTTP stream. I'd prefer to not have to buffer significant amounts of data in memory or on disk while this is happening. Using multipart could work here, but it would be better to limit the number of concurrent parts, as there may be multiple objects being moved in this way, so I'd probably end up setting max concurrency to a relatively small value.
If the incoming stream is interrupted, how does boto handle partial multipart uploads? I was looking through the s3transfer upload code and didn't see handling of partial uploads (I could've missed it!). I suspect the suggested solution would be to set a bucket policy to expire multipart uploads after a certain amount of time.
Specific to my use case, using multipart uploads introduces an additional problem: ETags are not related to the actual content. It will no longer be possible for me to validate that the object I got from one store is the same one in S3 by default. That would require, probably, propagating the content etag as an extra metadata field. This is particularly painful as LIST will no longer return a useful etag to validate whether an object was copied or not.
While multipart works with caveats, I don't think it's the right abstraction, as the case of uploading an HTTP stream seems quite more complicated than it would be otherwise. Chunked uploads to me seem like a natural fit due to these issues.
For what it's worth, I submitted a PR to add the option to opt into chunked uploads and am happy to iterate on that if it's something that would make sense within botocore.
@jamesls, I am currently implementing a system where a distributed streaming transfer of multiple parts in a multipart upload happens on resource-limited hosts (Lambdas), and must work even with very large files. I'm not 100% sure what design I'll settle on yet, but the crux of the issue is that I have to download a part from a remote host, then upload it to S3, and neither memory nor disk are good places to hold the part (which for large files may get up to 500 MB (5TB/10000)). I'd like to use truly streaming chunked transfer.
I am currently writing a Lex client and having the streaming support would help with performance as the audio from the microphone could be live streamed up to the Lex service rather than having to buffer it all locally first. This is explicitly mentioned in the Lex documentation: http://docs.aws.amazon.com/lex/latest/dg/API_runtime_PostContent.html#lex-runtime_PostContent-request-inputStream I have tried this with the node-js sdk, and can confirm that streaming makes significant performance improvement.
Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.
Streaming sigv4 uploads are still a missing feature in boto3, needed for a variety of streaming applications where one does not know the size in advance, and would like to minimize latency/bufferbloat and avoid holding the entire file in memory or on disk.
Currently the upload will calculate the hash, then send this hash in the headers, then read the data a second time to upload it.
Afaik there is no buffer for this, and these are 2 reads so it's not possible to upload without a seekable file.
The Golang sdk alleviate this issue by disabling the checksum altogether: https://aws.github.io/aws-sdk-go-v2/docs/sdk-utilities/s3/#unseekable-streaming-input In this case, under certain conditions, we can still compute the checksum during the transfers and verify the checksum after the full transfer using the Etag, but does that mean the benefit of multi-part upload is lost then ?