valibot
valibot copied to clipboard
Improve performance of maxBytes()
I'd like to use maxBytes() to help prevent certain types of abuse sending large amounts of data to clients in order to overload them. This makes it fairly important that maxBytes() is as fast as possible and is not itself a bottleneck if someone attempts to send hundreds of megabytes of data
Before:
Since the current implementation will always read all of the bytes of the string, there is no significant difference between size limits enforced
| input/limit | any size |
|---|---|
| 1B | 1โ894โ455.1 |
| 5MB | 1โ507.1 |
| 50MB | 150.1 |
| 500MB | 13.4 |
| max | 12.5 |
All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.
After: (ops/sec)
Caching Uint8Array speeds up 1B inputs by about 30x, and because this new implementation stops reading/writing N+4 after the maxBytes requirement, the performance is more dependent on the requirement than the input itself
Note: It needs to read past the
requirementto test if the string is too long, the+4bytes is because the next character in the string could be anywhere from 1-4 bytes, andencodeInto()will drop it if there's not enough room in the array.
| input/limit | <=5MB | <=50MB | <=500MB |
|---|---|---|---|
| 1B | 30โ240โ582.3 | 31โ011โ981.5 | 30โ389โ671.3 |
| 5MB | 6โ611.1 | 7โ095.6 | 7โ071.8 |
| 50MB | 6โ736.9 | 551.4 | 498.7 |
| 500MB | 6โ685.8 | 547.4 | 51.6 |
| max | 6โ318.7 | 538.0 | 51.5 |
All values are operations per second, tested on an 2021 Apple M1 Pro MacBook Pro.
Thanks for the catch! Why would you prefer to use maxBytes instead of maxLength?
It just tends to be how we specify things. JavaScript strings are UTF-16 encoded, but many languages default to UTF-8 encoded (Rust, Swift, Golang, Ruby, etc). If you don't check these in an encoding-aware way, you end up with strings that are allowed to be different lengths depending on the language you are checking them in.
Aside: I've also opened an issue in the whatwg/encoding standard to suggest a faster method for this https://github.com/whatwg/encoding/issues/333
You can actually make this even faster by avoiding calculating the specific number of bytes when possible:
if (dataset.value > maxBytes) {
// The minimum number of bytes is already too long
}
if (dataset.value * 3 <= maxBytes) {
// The maximum number of bytes is already small enough
}
You can actually make this even faster by avoiding calculating the specific number of bytes when possible:
Can you provide more details? When does this work? In general, how should I proceed with this PR? What is your recommendation?
Sure, let me give you some context:
Encoding is expensive
Strings in JavaScript are sequences of UTF-16 code units.
- In UTF-16, each character is 1 or 2 code units (2 or 4 bytes)
- In UTF-8, each character is 1 to 4 code units (1 to 4 bytes)
Converting a sequence of UTF-16 code units to UTF-8 code units is relatively expensive operation that involves a fair bit of math, but right now it's the only way to calculate the UTF-8 byte length of a string in browsers.
It's much faster to calculate just the number of bytes in a string because you can skip the work to convert them into their specific values and just match UTF-16 ranges to byte values. This is the primary reason why functions like Buffer.byteLength(input) are much faster than new TextEncoder().encode(input) (besides not needing to allocate more memory).
But since that's not an option on the web, you could at least avoid encoding the entire string, and only check if it's too long, which is what this PR does with encodeInto()
Skipping encoding when possible
You can optimize this even further by avoiding encoding at all with a little bit of knowledge about how UTF-16 code units get converted to UTF-8 bytes.
| Char | UTF-16 | UTF-8 | UTF-16 units to UTF-8 bytes | ||
|---|---|---|---|---|---|
'a' |
0061 |
(1 units โ 2 bytes) | 41 |
(1 units โ 1 bytes) | 1 โ 1 (min) |
'ยข' |
00A2 |
(1 units โ 2 bytes) | C2 A2 |
(2 units โ 2 bytes) | 1 โ 2 |
'เธ' |
0E01 |
(1 units โ 2 bytes) | E0 B8 81 |
(3 units โ 3 bytes) | 1 โ 3 (max) |
'๐ข' |
D834 DD22 |
(2 units โ 4 bytes) | F0 9D 84 A2 |
(4 units โ 4 bytes) | 2 โ 4 |
The conversion ratio of UTF-16 code units when encoded in UTF-8 is 1-3 bytes.
So without having to encode anything, we can know that the max and min possible byte length for any JavaScript string just by doing:
let MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT = 1
let MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT = 3
let min = string.length * MIN_UTF8_BYTES_PER_UTF16_CODE_UNIT
let max = string.length * MAX_UTF8_BYTES_PER_UTF16_CODE_UNIT
Not needing to encode anything will speed up the vast majority of usage of maxBytes()
Optimized Solution
This is a slightly updated version of the current PR which is currently the fastest option for asserting that a string is under a certain UTF-8 byte length:
let encoder: TextEncoder
function maxBytes(bytes: number) {
let array: Uint8Array
return function check(input: string): boolean {
if (input.length > requirement) return false
if (input.length * 3 <= requirement) return true
encoder ??= new TextEncoder();
array ??= new Uint8Array(bytes);
let read = cachedTextEncoder.encodeInto(input, cachedUint8Array).read
return read <= input.length
}
}
Problem
The only problem is that this doesn't give you a received value of a specific byte length. The only reason that this is fast is because it gives up computing the exact length of the string once it's past the limit.
If you're okay with dropping the received value from the issue that this adds, you could use this more optimized version
Thanks for the details! I now understand the problem and possible solutions much better. I am not sure what to do. We use the expected and received pattern everywhere. So I am not sure if we should do an execution here. On the other hand, of course, I see the downside that this could be abused by sending extrem long strings to the server.
Some options:
- Drop
received - Make
receivedan estimate`${string.length} to ${string.length * 3}` - Make
receivedan estimate only at larger sizes - Make
receivedthe UTF-16 bytes`${string.length * 2} UTF-16 bytes` - Add an option to disable
received
I think at the moment I prefer to wait until more developers encounter this problem to get more feedback on how to proceed. In the meantime, as a workaround, you could implement a fastMaxBytes action yourself for use in your own schemas.
I close this PR for now but if there is interest in adding a fastMaxBytes action in the long run feel free to leave a comment.