valibot Improve performance of maxBytes()

I'd like to use maxBytes() to help prevent certain types of abuse sending large amounts of data to clients in order to overload them. This makes it fairly important that maxBytes() is as fast as possible and is not itself a bottleneck if someone attempts to send hundreds of megabytes of data

Before:

Since the current implementation will always read all of the bytes of the string, there is no significant difference between size limits enforced

input/limit	any size
1B	1’894’455.1
5MB	1’507.1
50MB	150.1
500MB	13.4
max	12.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

After: (ops/sec)

Caching Uint8Array speeds up 1B inputs by about 30x, and because this new implementation stops reading/writing N+4 after the maxBytes requirement, the performance is more dependent on the requirement than the input itself

Note: It needs to read past the requirement to test if the string is too long, the +4 bytes is because the next character in the string could be anywhere from 1-4 bytes, and encodeInto() will drop it if there's not enough room in the array.

input/limit	<=5MB	<=50MB	<=500MB
1B	30’240’582.3	31’011’981.5	30’389’671.3
5MB	6’611.1	7’095.6	7’071.8
50MB	6’736.9	551.4	498.7
500MB	6’685.8	547.4	51.6
max	6’318.7	538.0	51.5

All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.

Jul 26 '24 19:07 jamiebuilds

Thanks for the catch! Why would you prefer to use maxBytes instead of maxLength?

Jul 27 '24 15:07 fabian-hiller

It just tends to be how we specify things. JavaScript strings are UTF-16 encoded, but many languages default to UTF-8 encoded (Rust, Swift, Golang, Ruby, etc). If you don't check these in an encoding-aware way, you end up with strings that are allowed to be different lengths depending on the language you are checking them in.

Aside: I've also opened an issue in the whatwg/encoding standard to suggest a faster method for this https://github.com/whatwg/encoding/issues/333

Jul 29 '24 15:07 jamiebuilds

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

if (dataset.value > maxBytes) {
  // The minimum number of bytes is already too long
}

if (dataset.value * 3 <= maxBytes) {
  // The maximum number of bytes is already small enough
}

Jul 29 '24 16:07 jamiebuilds

You can actually make this even faster by avoiding calculating the specific number of bytes when possible:

Can you provide more details? When does this work? In general, how should I proceed with this PR? What is your recommendation?

Jul 29 '24 19:07 fabian-hiller

Sure, let me give you some context:

Encoding is expensive

Strings in JavaScript are sequences of UTF-16 code units.

In UTF-16, each character is 1 or 2 code units (2 or 4 bytes)
In UTF-8, each character is 1 to 4 code units (1 to 4 bytes)

Converting a sequence of UTF-16 code units to UTF-8 code units is relatively expensive operation that involves a fair bit of math, but right now it's the only way to calculate the UTF-8 byte length of a string in browsers.

It's much faster to calculate just the number of bytes in a string because you can skip the work to convert them into their specific values and just match UTF-16 ranges to byte values. This is the primary reason why functions like Buffer.byteLength(input) are much faster than new TextEncoder().encode(input) (besides not needing to allocate more memory).

But since that's not an option on the web, you could at least avoid encoding the entire string, and only check if it's too long, which is what this PR does with encodeInto()

Skipping encoding when possible

You can optimize this even further by avoiding encoding at all with a little bit of knowledge about how UTF-16 code units get converted to UTF-8 bytes.

Char	UTF-16		UTF-8		UTF-16 units to UTF-8 bytes
`'a'`	`0061`	(1 units → 2 bytes)	`41`	(1 units → 1 bytes)	1 → 1 (min)
`'¢'`	`00A2`	(1 units → 2 bytes)	`C2 A2`	(2 units → 2 bytes)	1 → 2
`'ก'`	`0E01`	(1 units → 2 bytes)	`E0 B8 81`	(3 units → 3 bytes)	1 → 3 (max)
`'𝄢'`	`D834 DD22`	(2 units → 4 bytes)	`F0 9D 84 A2`	(4 units → 4 bytes)	2 → 4

The conversion ratio of UTF-16 code units when encoded in UTF-8 is 1-3 bytes.

So without having to encode anything, we can know that the max and min possible byte length for any JavaScript string just by doing:

let MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT = 1
let MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT = 3

let min = string.length * MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT
let max = string.length * MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT

Not needing to encode anything will speed up the vast majority of usage of maxBytes()

Optimized Solution

This is a slightly updated version of the current PR which is currently the fastest option for asserting that a string is under a certain UTF-8 byte length:

let encoder: TextEncoder
function maxBytes(bytes: number) {
  let array: Uint8Array
  return function check(input: string): boolean {
    if (input.length > requirement) return false
    if (input.length * 3 <= requirement) return true
    encoder ??= new TextEncoder();
    array ??= new Uint8Array(bytes);
    let read = cachedTextEncoder.encodeInto(input, cachedUint8Array).read
    return read <= input.length
  }
}

Problem

The only problem is that this doesn't give you a received value of a specific byte length. The only reason that this is fast is because it gives up computing the exact length of the string once it's past the limit.

If you're okay with dropping the received value from the issue that this adds, you could use this more optimized version

Jul 31 '24 00:07 jamiebuilds

Thanks for the details! I now understand the problem and possible solutions much better. I am not sure what to do. We use the expected and received pattern everywhere. So I am not sure if we should do an execution here. On the other hand, of course, I see the downside that this could be abused by sending extrem long strings to the server.

Jul 31 '24 08:07 fabian-hiller

Some options:

Drop received
Make received an estimate `${string.length} to ${string.length * 3}`
Make received an estimate only at larger sizes
Make received the UTF-16 bytes `${string.length * 2} UTF-16 bytes`
Add an option to disable received

Aug 06 '24 16:08 jamiebuilds

I think at the moment I prefer to wait until more developers encounter this problem to get more feedback on how to proceed. In the meantime, as a workaround, you could implement a fastMaxBytes action yourself for use in your own schemas.

Aug 10 '24 09:08 fabian-hiller

I close this PR for now but if there is interest in adding a fastMaxBytes action in the long run feel free to leave a comment.

Mar 18 '25 00:03 fabian-hiller

valibot valibot copied to clipboard

Improve performance of maxBytes()

Before:

After: (ops/sec)

Encoding is expensive

Skipping encoding when possible

Optimized Solution

Problem

valibot
valibot copied to clipboard