This PR implements a multipart/form-data parser with a 1:1 match to busboy's API. This took me about a week to complete and many, many hours. I believe docs and types are needed still, but don't feel like adding them right now.

Tests from busboy are included, and every single test works without any modifications!

[ ] - docs
[ ] - benchmarks
[x] - types
[x] - replace busboy in fetch
[x] - use ~~c8~~ nyc for coverage in busboy tests (busboy uses vanilla node for testing)
[ ] - ??

Bug(s):

[x] ~~path.basename returns different values in windows & linux (C:\\files\\1k_b.dat)~~ fixed in 56052ec8fa47c566f8536b79c26c80eeda51268d
[x] ~~reading headers is extremely slow~~ fixed in b7b9645e54df7ba5f42eeeca1626d28eadfcb385, parsing headers should now be O(n)
[x] ~~in certain circumstances FileStream will not emit the 'end' event~~ fixed in dd02fe5930507d67f20acf9044aa7e6036c65df1

Jan 09 '23 22:01 KhafraDev

Codecov Report

Base: 90.43% // Head: 90.26% // Decreases project coverage by -0.17% :warning:

Coverage data is based on head (e21de08) compared to base (06f77a9). Patch coverage: 87.53% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1851      +/-   ##
==========================================
- Coverage   90.43%   90.26%   -0.17%     
==========================================
  Files          71       76       +5     
  Lines        6137     6463     +326     
==========================================
+ Hits         5550     5834     +284     
- Misses        587      629      +42

Impacted Files	Coverage Δ
lib/formdata/util.js	`85.71% <85.71%> (ø)`
lib/formdata/parser.js	`86.07% <86.07%> (ø)`
lib/formdata/textsearch.js	`93.75% <93.75%> (ø)`
index.js	`99.00% <100.00%> (+0.02%)`	:arrow_up:
lib/fetch/body.js	`97.15% <100.00%> (ø)`
lib/fetch/dataURL.js	`88.46% <100.00%> (+0.38%)`	:arrow_up:
lib/formdata/constants.js	`100.00% <100.00%> (ø)`
lib/formdata/filestream.js	`100.00% <100.00%> (ø)`
lib/fetch/file.js	`89.65% <0.00%> (-1.15%)`	:arrow_down:
lib/fetch/index.js	`84.93% <0.00%> (+0.18%)`	:arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Jan 09 '23 23:01 codecov-commenter

@KhafraDev How does the perf compare?

Jan 10 '23 12:01 kibertoad

Needs benchmarks but probably slower in some areas, same in others. Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc. While this implementation is much... lazier/easier. Rather than doing that, I join chunks together until one of them contains the boundary, and then split the chunks accordingly. Of course without benchmarks, most of that is just speculation of what's faster.

Jan 10 '23 15:01 KhafraDev

Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc

multipart/form-data is really old and historical. It isn't the best solution for parsing and sending files. Something that would be better is to have something more like structural data that tells how large a field is in size so that you can read x amount of data instead of scanning a boundary.

I asked if we could at least add content-length to each and every field. (not just the total body size). that would make boundary a bit more obsolete. but it got very idle.

you know what would be awesome? having cbor support baked into fetch! await new Resoponse(data).cbor() it would totally kick out formdata, json, text and blobs out of existence by having more structural data support. it would more or less support the same things that you could do with an object and structuralClone on. (just wishful thinking)

Jan 10 '23 15:01 jimmywarting

I asked if we could at least add content-length to each and every field.

I was actually wondering why there wasn't a content-length header -- it's a really poor design overall. What makes websocket frame parsing so much easier is that you know the exact payload length!

Jan 10 '23 16:01 KhafraDev

I'm much more interested in replacing the asynchronous parsing with a synchronous parser. Not only is it actually spec compliant, but it will finally cleanup the body consumers. It doesn't make sense to parse a multipart/form-data body asynchronously if it's already in memory. It's also many times slower than Deno/bun (https://twitter.com/jarredsumner/status/1625067963656335361).

cc @jimmywarting I know you disagree with this, do you have any counterarguments before I replace busboy with the proposed alternative?

Feb 24 '23 21:02 KhafraDev

I don't think a synchronous multipart parser would work for us long term. Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

Feb 24 '23 23:02 mcollina

This doesn't seem to be an issue with Deno/bun (or any of the browsers, although their use case is different). Since request.body is a ReadableStream, we could document formData's downsides and show an example of asynchronously parsing the body with busboy (or this async parser). It makes it much harder to maintain this section of the code, and many of the WPTs regarding the formdata body mixin are disabled :(.

There is also a proposal to add a limit to the max body size a FormData can have (https://github.com/whatwg/fetch/issues/1592) and it's already an option on undici's Agent. There are alternative apis that undici has that are much better suited for large bodies. The main users of .formData IIRC are library authors who want cross-env compatibility - I wonder how many users are active using it?

One more note: it's incredibly slow. https://twitter.com/jarredsumner/status/1625067963656335361

Feb 25 '23 00:02 KhafraDev

There's also the issue of how parsing a multipart/form-data should be parsed and there's no actual spec for it. The WPT coverage sucks as well, while it's really good for every other mixin.

Personally I'd like to follow the spec and other platforms who have all implemented the parsing synchronously. See #1694 for proof.

Feb 25 '23 00:02 KhafraDev

I don't think we could ever be in agreement on this one. Crashing for out-of-memory issues when receiving files is not safe. It's better to not ship a feature than shipping a feature that will crash everything: the number of vulnerabilities we will receive for this one would be massive.

Last time I checked, it seemed that Chrome used disk-based Blobs for at least some of those cases.

The difference with .json() and the others exists because we need all the data to process a JS Objects, which is not the case for this one.

Node.js has just landed support for disk-baker Blobs, so we will soon be able to use this ;).

Feb 25 '23 08:02 mcollina

I don't think a synchronous multipart parser would work for us long term. Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

I agree that disk based blob/files would be the best option and that async streaming is best for not hitting the RAM limit. would rather want to have something that's working and runs a bit slower. then rather (as @mcollina puts it) "not ship a feature at all" with the risk for out-of-memory issues.

it would also be technically possible to also write all of formdata payload to one single file first and then slice it (virtually - by just changing the offset + size, of where it should start/stop reading from)

const blob = fs.openAsBlob(path) // could include all formdata entries
const file1 = new File( [ blob.slice(100, 1024) ], entry.name, { type: entry.mimeType } )

so if you want to write all the data to the disc first and then synchronous search the file after each boundary to lookup where each file entry begin/ends then that could also be an option.

currently undici url encoded formdata decoder is synchronous. b/c it dose not include files, and probably isn't as large as formdata payloads are. and that it uses URLSearchParams to decode everything.

So i'm wondering is this: https://twitter.com/jarredsumner/status/1625067963656335361 measuring URL encoded payload or multipart formdata encoded payload, cuz it's two completely different path of solving the issue at hand.

we could document formData's downsides

if we are going to document the downside of formdata, then we shouldn't provide them with an alternative solution to how they should do it themself using busboy and streams. it should just work as it's intended to work. if we want to say that multipart/formdata it's slower then that's fine, recommend that ppl instead use URL encoded payload for faster parsing or use json. multipart/formdata is intended of file uploads so that is what it should be used for.

so if we should document anything at all, then it should be:

if you intend to send files then by all means you can use FormData() to send payload. but the multipart/formdata parsing will be slower then URLEncoded payloads. so, if you intend to send just simple forms with just text, then use new URLSearchParams() instead of new FormData() when you want to decode the payload with await body.formData() if you intend to send really big files then upload just raw bytes using { body: blob } as that has better streaming compatibility. choose the right encoding for the the job needed

Feb 25 '23 14:02 jimmywarting

I agree that we probably won't find middle ground for this 😄.

it should just work as it's intended to work.

That's partially the issue - it doesn't work as intended, if by "intended" we're referring to the spec and/or user expectations. There are other slightly less noticeable issues and pretty crappy workarounds we're doing already to match the spec: https://github.com/nodejs/undici/blob/06f77a92087f18151f1ed8c7eb25ad44351ba508/lib/fetch/body.js#L461-L462.

Anyways, this branch is mostly done, passes all of busboy's tests, passes the same set of WPTs that are enabled. I'll finish up the docs eventually and I think we're good to go? Although I've been holding off because I know there will inevitably be issues that I won't have motivation to fix...

Sticking with busboy is also problematic because it doesn't seem well maintained. There's a number of issues and pull requests open. Plus I've already spent many hours working on this lol

Feb 25 '23 19:02 KhafraDev

@KhafraDev What's the status on this? Are we using file backed Blob?

Apr 09 '23 05:04 ronag

It's mostly done, just need to spent a couple hours adding docs and cleaning stuff up. Is there a way to opt-in to using the file backed Blobs, or is it automatic?

Apr 09 '23 13:04 KhafraDev

Is there a way to opt-in to using the file backed Blobs, or is it automatic?

Perhaps if content-length is lower than x bytes then it would construct blobs in memory? Also what if you where to use response.blob()? would it be backed up by the fs then?

EDIT i tested response.blob in chrome. and looked at ~/Library/Application Support/Google/Chrome/Default/blob_storage and found that if the response was small it would just give you a in-memory blob.

but if the response was over a certain amount it would be offloaded it to the disc. so in the browser this pretty much happens automatically. all tough you would also need to clean up the temporary stored file when there is no reference to it anymore...

i suppose the same logic could be done with FormData, if it encounter a file start reading n bytes, if it's larger than that, then dump it to the file system and pipe the rest of the data to that file.

Apr 10 '23 09:04 jimmywarting

@KhafraDev did you give up on this?

Feb 25 '24 09:02 ronag

yes, the parsing should be done synchronously to match the spec.

Feb 25 '24 15:02 KhafraDev

undici
undici copied to clipboard

feat: implement 0 dependency streaming multipart/form-data parser

Codecov Report

undici undici copied to clipboard

feat: implement 0 dependency streaming multipart/form-data parser

Codecov Report

undici
undici copied to clipboard