undici
undici copied to clipboard
feat: implement 0 dependency streaming multipart/form-data parser
This PR implements a multipart/form-data parser with a 1:1 match to busboy's API. This took me about a week to complete and many, many hours. I believe docs and types are needed still, but don't feel like adding them right now.
Tests from busboy are included, and every single test works without any modifications!
- [ ] - docs
- [ ] - benchmarks
- [x] - types
- [x] - replace busboy in fetch
- [x] - use ~~c8~~ nyc for coverage in busboy tests (busboy uses vanilla node for testing)
- [ ] - ??
Bug(s):
- [x] ~~
path.basename
returns different values in windows & linux (C:\\files\\1k_b.dat
)~~ fixed in 56052ec8fa47c566f8536b79c26c80eeda51268d - [x] ~~reading headers is extremely slow~~ fixed in b7b9645e54df7ba5f42eeeca1626d28eadfcb385, parsing headers should now be O(n)
- [x] ~~in certain circumstances FileStream will not emit the 'end' event~~ fixed in dd02fe5930507d67f20acf9044aa7e6036c65df1
Codecov Report
Base: 90.43% // Head: 90.26% // Decreases project coverage by -0.17%
:warning:
Coverage data is based on head (
e21de08
) compared to base (06f77a9
). Patch coverage: 87.53% of modified lines in pull request are covered.
Additional details and impacted files
@@ Coverage Diff @@
## main #1851 +/- ##
==========================================
- Coverage 90.43% 90.26% -0.17%
==========================================
Files 71 76 +5
Lines 6137 6463 +326
==========================================
+ Hits 5550 5834 +284
- Misses 587 629 +42
Impacted Files | Coverage Δ | |
---|---|---|
lib/formdata/util.js | 85.71% <85.71%> (ø) |
|
lib/formdata/parser.js | 86.07% <86.07%> (ø) |
|
lib/formdata/textsearch.js | 93.75% <93.75%> (ø) |
|
index.js | 99.00% <100.00%> (+0.02%) |
:arrow_up: |
lib/fetch/body.js | 97.15% <100.00%> (ø) |
|
lib/fetch/dataURL.js | 88.46% <100.00%> (+0.38%) |
:arrow_up: |
lib/formdata/constants.js | 100.00% <100.00%> (ø) |
|
lib/formdata/filestream.js | 100.00% <100.00%> (ø) |
|
lib/fetch/file.js | 89.65% <0.00%> (-1.15%) |
:arrow_down: |
lib/fetch/index.js | 84.93% <0.00%> (+0.18%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
@KhafraDev How does the perf compare?
Needs benchmarks but probably slower in some areas, same in others. Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc. While this implementation is much... lazier/easier. Rather than doing that, I join chunks together until one of them contains the boundary, and then split the chunks accordingly. Of course without benchmarks, most of that is just speculation of what's faster.
Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc
multipart/form-data is really old and historical. It isn't the best solution for parsing and sending files. Something that would be better is to have something more like structural data that tells how large a field is in size so that you can read x amount of data instead of scanning a boundary.
I asked if we could at least add content-length
to each and every field. (not just the total body size). that would make boundary a bit more obsolete. but it got very idle.
you know what would be awesome? having cbor
support baked into fetch! await new Resoponse(data).cbor()
it would totally kick out formdata, json, text and blobs out of existence by having more structural data support. it would more or less support the same things that you could do with an object and structuralClone
on. (just wishful thinking)
I asked if we could at least add content-length to each and every field.
I was actually wondering why there wasn't a content-length header -- it's a really poor design overall. What makes websocket frame parsing so much easier is that you know the exact payload length!
I'm much more interested in replacing the asynchronous parsing with a synchronous parser. Not only is it actually spec compliant, but it will finally cleanup the body consumers. It doesn't make sense to parse a multipart/form-data body asynchronously if it's already in memory. It's also many times slower than Deno/bun (https://twitter.com/jarredsumner/status/1625067963656335361).
cc @jimmywarting I know you disagree with this, do you have any counterarguments before I replace busboy with the proposed alternative?
I don't think a synchronous multipart parser would work for us long term. Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.
This doesn't seem to be an issue with Deno/bun (or any of the browsers, although their use case is different). Since request.body
is a ReadableStream, we could document formData's downsides and show an example of asynchronously parsing the body with busboy (or this async parser). It makes it much harder to maintain this section of the code, and many of the WPTs regarding the formdata body mixin are disabled :(.
There is also a proposal to add a limit to the max body size a FormData can have (https://github.com/whatwg/fetch/issues/1592) and it's already an option on undici's Agent. There are alternative apis that undici has that are much better suited for large bodies. The main users of .formData IIRC are library authors who want cross-env compatibility - I wonder how many users are active using it?
One more note: it's incredibly slow. https://twitter.com/jarredsumner/status/1625067963656335361
There's also the issue of how parsing a multipart/form-data should be parsed and there's no actual spec for it. The WPT coverage sucks as well, while it's really good for every other mixin.
Personally I'd like to follow the spec and other platforms who have all implemented the parsing synchronously. See #1694 for proof.
I don't think we could ever be in agreement on this one. Crashing for out-of-memory issues when receiving files is not safe. It's better to not ship a feature than shipping a feature that will crash everything: the number of vulnerabilities we will receive for this one would be massive.
Last time I checked, it seemed that Chrome used disk-based Blobs for at least some of those cases.
The difference with .json()
and the others exists because we need all the data to process a JS Objects, which is not the case for this one.
Node.js has just landed support for disk-baker Blobs, so we will soon be able to use this ;).
I don't think a synchronous multipart parser would work for us long term. Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.
I agree that disk based blob/files would be the best option and that async streaming is best for not hitting the RAM limit. would rather want to have something that's working and runs a bit slower. then rather (as @mcollina puts it) "not ship a feature at all" with the risk for out-of-memory issues.
it would also be technically possible to also write all of formdata payload to one single file first and then slice it (virtually - by just changing the offset + size, of where it should start/stop reading from)
const blob = fs.openAsBlob(path) // could include all formdata entries
const file1 = new File( [ blob.slice(100, 1024) ], entry.name, { type: entry.mimeType } )
so if you want to write all the data to the disc first and then synchronous search the file after each boundary to lookup where each file entry begin/ends then that could also be an option.
currently undici url encoded formdata decoder is synchronous. b/c it dose not include files, and probably isn't as large as formdata payloads are. and that it uses URLSearchParams to decode everything.
So i'm wondering is this: https://twitter.com/jarredsumner/status/1625067963656335361 measuring URL encoded payload or multipart formdata encoded payload, cuz it's two completely different path of solving the issue at hand.
we could document formData's downsides
if we are going to document the downside of formdata, then we shouldn't provide them with an alternative solution to how they should do it themself using busboy and streams. it should just work as it's intended to work. if we want to say that multipart/formdata it's slower then that's fine, recommend that ppl instead use URL encoded payload for faster parsing or use json. multipart/formdata is intended of file uploads so that is what it should be used for.
so if we should document anything at all, then it should be:
if you intend to send files then by all means you can use
FormData()
to send payload. but the multipart/formdata parsing will be slower then URLEncoded payloads. so, if you intend to send just simple forms with just text, then usenew URLSearchParams()
instead ofnew FormData()
when you want to decode the payload withawait body.formData()
if you intend to send really big files then upload just raw bytes using{ body: blob }
as that has better streaming compatibility. choose the right encoding for the the job needed
I agree that we probably won't find middle ground for this 😄.
it should just work as it's intended to work.
That's partially the issue - it doesn't work as intended, if by "intended" we're referring to the spec and/or user expectations. There are other slightly less noticeable issues and pretty crappy workarounds we're doing already to match the spec: https://github.com/nodejs/undici/blob/06f77a92087f18151f1ed8c7eb25ad44351ba508/lib/fetch/body.js#L461-L462.
Anyways, this branch is mostly done, passes all of busboy's tests, passes the same set of WPTs that are enabled. I'll finish up the docs eventually and I think we're good to go? Although I've been holding off because I know there will inevitably be issues that I won't have motivation to fix...
Sticking with busboy is also problematic because it doesn't seem well maintained. There's a number of issues and pull requests open. Plus I've already spent many hours working on this lol
@KhafraDev What's the status on this? Are we using file backed Blob?
It's mostly done, just need to spent a couple hours adding docs and cleaning stuff up. Is there a way to opt-in to using the file backed Blobs, or is it automatic?
Is there a way to opt-in to using the file backed Blobs, or is it automatic?
Perhaps if content-length
is lower than x bytes then it would construct blobs in memory?
Also what if you where to use response.blob()
? would it be backed up by the fs then?
EDIT i tested response.blob
in chrome. and looked at ~/Library/Application Support/Google/Chrome/Default/blob_storage
and found that if the response was small it would just give you a in-memory blob.
but if the response was over a certain amount it would be offloaded it to the disc. so in the browser this pretty much happens automatically. all tough you would also need to clean up the temporary stored file when there is no reference to it anymore...
i suppose the same logic could be done with FormData, if it encounter a file start reading n bytes, if it's larger than that, then dump it to the file system and pipe the rest of the data to that file.
@KhafraDev did you give up on this?
yes, the parsing should be done synchronously to match the spec.