aws-lite
aws-lite copied to clipboard
S3: Pipe Readable to PutObject with File param
- [x] Forked the repo and created your branch from
master - [x] Made sure tests pass (run
npm itfrom the repo root) - [ ] Expanded test coverage related to your changes:
- [ ] Added and/or updated unit tests (if appropriate)
- [ ] Added and/or updated integration tests (if appropriate)
- [ ] Updated relevant documentation:
- [ ] Internal to this repo (e.g.
readme.md, help docs, inline docs & comments, etc.) - [ ] Architect docs (arc.codes)
- [ ] Internal to this repo (e.g.
- [ ] Summarized your changes in
changelog.md - [x] Linked to any related issues, PRs, etc. below that may relate to, consume, or necessitate these changes
Stream files to PutObject
(See #65)
Basically:
- Don't chunk uploads
- Don't sign payloads
- When there's a
Filearg, usefs.createReadStream(File)as payload
The AWS SDK doesn't sign payloads by default (s3DisableBodySigning in v2 defaults to true, applyChecksum is the v3 option). AFAICT, they only time you need the payload signed is when it's required by bucket policy.
I'm sure there's plenty to discuss here, but I was kinda mystified when I started looking at the wire traffic when using the AWS SDK...there wasn't anything fancy going on at all. There are code paths for signing chunked payloads...but they weren't running. Basically, all the tools needed to make this work were already present.
This first draft is a breaking change:
- If your bucket requires signed payloads, this won't work
- The
MinChunkSizeparam is removed because of irrelevance (only TS will care about this really)
I tested this with something like this:
systemd-run --scope -p MemoryMax=50M -p MemorySwapMax=50M --user node ./awesome-script
...that uploaded a 5GB file to a private bucket successfully. So, this works for me AFAICT, I've patched it in one of my projects, and will be running this in the real world.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
Now, some thoughts on chunking uploads:
- There's never any value AFAICT in chunking
stringandBuffer. They're already in memory, so the previous code was just doubling (at least) the memory footprint by precomputing chunks for no real reason. - It's allowed for the last chunk to be smaller than the minimum chunk size, even if it's the only chunk (per docs), so that would be viable for
stringandBufferpayloads, but I dunno if it'd be worth it. - The only reason I would use chunking is pretty much for the file case, and for pretty much the same reasons that I'm using this change: to avoid loading the whole file into memory.
- It's possible to calculate chunks and signatures (which can have variable sizes) without precomputing all the chunks. To do this, you'd need to use
Transfer-Encoding: chunked(this is what the SDK does too). - I was on my way down the road of all of the above step before I realized I didn't need it. I think it's pretty reasonable that you could write a
Transformthat generates the chunks. Node streams are kinda weird, so I'm still not sure about whether backpressure is handled magically or not, but that would be important for my use case. - This is only a theory/guess/hope, but you can put HTTP headers in the trailer chunk. I don't know that it's impossible to populate
x-amz-decoded-content-lengthin that section for stream when you don't know the content length ahead of time. Really to test this out I'd plug in other kind ofReadableto the SDK and see how it handled things.
Also, to be clear: this is just a POC and a conversation starter. I hope I didn't sound jerky or unreasonable above.
My immediate goal is met with this I think, but I think there is a reasonable possibility to do all the things. I'd prefer to see the File param go away, and loosen Body to accept a Readable as well, similar to the SDK interface. I do think payload signing should be optional, but also supported while streaming (at least from files).
I also kinda wanna implement the stuff I was talking about now that I've dug into it a bit.
@keithlayne thanks for this! None of this sounds jerky or unreasonable at all.
Some initial thoughts: I think using unsigned payloads makes sense as a way to shortcut us to streaming directly from the filesystem (for the reasons you already enumerated), and could be utilized as an alternative to the signed payloads. Howevewr, I don't see the value of making breaking changes or removing existing functionality.
I'm thinking it's a solid idea to change the default to unsigned payloads, but to also leave the signed payloads an opt-in for those that need it (as per the AWS SDK options you mentioned above).
A couple other bits:
There's never any value AFAICT in chunking string and Buffer. They're already in memory, so the previous code was just doubling (at least) the memory footprint by precomputing chunks for no real reason.
Well, I wouldn't say for "no real reason", there were a variety of reasons I implemented things as I did, but it's definitely fair to argue that we should endeavor not to use memory unnecessarily; as I mentioned above, it seems like your proposed unsigned approach is the better default.
It's possible to calculate chunks and signatures (which can have variable sizes) without precomputing all the chunks. To do this, you'd need to use Transfer-Encoding: chunked (this is what the SDK does too).
Not sure I follow this, are you referring to aws-chunked transfer-encoding? (Do you have a docs link on chunked? I can't find anything on that.)
I think it's pretty reasonable that you could write a Transform that generates the chunks. Node streams are kinda weird, so I'm still not sure about whether backpressure is handled magically or not, but that would be important for my use case.
Yeah, I looked into that when I wrote the first cut of this, and decided to just go with a simpler implementation with the intention to improve / optimize later. So, here we are! :)
Hey @ryanblock thanks for your input!
I was referring to the docs here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html and the note at the top.
Transfer-Encoding: chunked is a separate thing: https://en.wikipedia.org/wiki/Chunked_transfer_encoding
Combining Content-Encoding: aws-chunked with Transfer-Encoding: chunked allows you to omit the Content-Length header (you don't need to know the full payload+metadata size ahead of time) and instead use x-amz-decoded-content-length for the size of the bytes that you want to write to S3, which for string or Buffer or fs.createReadStream is straightforward.
Here's what AWS SDK uses for body length: https://github.com/smithy-lang/smithy-typescript/blob/main/packages/util-body-length-node/src/calculateBodyLength.ts
(Note that I think they are actually missing some cases there - you can pass start and/or end to fs.createReadStream, which default to 0 and Infinity, respectively)
I don't think it's likely that other Readables with unknown length are likely to be able to work in this manner, but seeing if the sdk handles these is something I wanna look into. If you needed to support that I guess you could always consume the whole stream first...but I'm veering off-topic.
oof. There are lots of corner cases with files, seems like. fs.createReadStream gets an FD, but...the results of stat/fstat/lstat can change if you append a file after the read stream is created. I played with this some and saw some interesting things but the bottom line is probably that files are hard.
Please forgive my compulsive sharing of trivia.
@keithlayne I think we can close this as of v0.21.0 and S3 v0.1.21, no?