utf-8-validate
utf-8-validate copied to clipboard
Use the simdutf library
Here are some benchmarks using the uv benchmark suite:
$ npx envinfo --system
System:
OS: macOS 11.5
CPU: (16) x64 Intel(R) Xeon(R) W-2140B CPU @ 3.20GHz
Memory: 19.39 GB / 32.00 GB
Shell: 5.1.8 - /usr/local/bin/bash
$ node bench.js
Loading https://en.wikipedia.org/wiki/Main_Page ...
uv x 17,911 ops/sec ±0.08% (93 runs sampled)
utf-8-validate (5.0.5, C++) x 110,868 ops/sec ±0.09% (96 runs sampled)
utf-8-validate (simdutf, C++) x 698,016 ops/sec ±0.16% (95 runs sampled)
utf-8-validate (5.0.5, JS) x 10,086 ops/sec ±0.08% (99 runs sampled)
isutf8 x 12,411 ops/sec ±0.43% (97 runs sampled)
------------------------------------------------------------
Loading https://ro.wikipedia.org/wiki/Pagina_principală ...
uv x 7,570 ops/sec ±0.51% (95 runs sampled)
utf-8-validate (5.0.5, C++) x 25,982 ops/sec ±0.46% (95 runs sampled)
utf-8-validate (simdutf, C++) x 160,639 ops/sec ±0.40% (93 runs sampled)
utf-8-validate (5.0.5, JS) x 5,091 ops/sec ±0.32% (96 runs sampled)
isutf8 x 6,293 ops/sec ±0.11% (100 runs sampled)
------------------------------------------------------------
Loading https://ru.wikipedia.org/wiki/Заглавная_страница ...
uv x 7,467 ops/sec ±0.28% (99 runs sampled)
utf-8-validate (5.0.5, C++) x 16,716 ops/sec ±0.41% (93 runs sampled)
utf-8-validate (simdutf, C++) x 193,080 ops/sec ±0.19% (92 runs sampled)
utf-8-validate (5.0.5, JS) x 5,011 ops/sec ±0.08% (98 runs sampled)
isutf8 x 6,298 ops/sec ±0.25% (99 runs sampled)
------------------------------------------------------------
Loading https://ar.wikipedia.org/wiki/الصفحة_الرئيسية ...
uv x 5,784 ops/sec ±0.07% (97 runs sampled)
utf-8-validate (5.0.5, C++) x 12,480 ops/sec ±0.09% (98 runs sampled)
utf-8-validate (simdutf, C++) x 153,133 ops/sec ±0.15% (97 runs sampled)
utf-8-validate (5.0.5, JS) x 4,113 ops/sec ±0.06% (98 runs sampled)
isutf8 x 5,148 ops/sec ±0.09% (98 runs sampled)
------------------------------------------------------------
Loading https://ja.wikipedia.org/wiki/メインページ ...
uv x 10,007 ops/sec ±0.08% (97 runs sampled)
utf-8-validate (5.0.5, C++) x 23,876 ops/sec ±0.09% (94 runs sampled)
utf-8-validate (simdutf, C++) x 225,834 ops/sec ±0.15% (98 runs sampled)
utf-8-validate (5.0.5, JS) x 6,908 ops/sec ±0.09% (98 runs sampled)
isutf8 x 6,832 ops/sec ±0.09% (99 runs sampled)
------------------------------------------------------------
Loading https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt ...
uv x 48,083 ops/sec ±0.08% (98 runs sampled)
utf-8-validate (5.0.5, C++) x 67,217 ops/sec ±0.11% (94 runs sampled)
utf-8-validate (simdutf, C++) x 883,553 ops/sec ±0.11% (100 runs sampled)
utf-8-validate (5.0.5, JS) x 39,180 ops/sec ±0.08% (98 runs sampled)
isutf8 x 39,953 ops/sec ±0.09% (97 runs sampled)
------------------------------------------------------------
Preparing 256B of random ASCII data
uv x 5,368,671 ops/sec ±0.14% (98 runs sampled)
utf-8-validate (5.0.5, C++) x 8,219,664 ops/sec ±0.08% (98 runs sampled)
utf-8-validate (simdutf, C++) x 4,609,830 ops/sec ±0.48% (93 runs sampled)
utf-8-validate (5.0.5, JS) x 3,033,199 ops/sec ±0.10% (95 runs sampled)
isutf8 x 3,000,818 ops/sec ±0.08% (99 runs sampled)
------------------------------------------------------------
Preparing 1KB of random ASCII data
uv x 1,391,086 ops/sec ±0.07% (97 runs sampled)
utf-8-validate (5.0.5, C++) x 5,043,617 ops/sec ±0.07% (94 runs sampled)
utf-8-validate (simdutf, C++) x 4,458,192 ops/sec ±0.42% (93 runs sampled)
utf-8-validate (5.0.5, JS) x 777,434 ops/sec ±0.08% (94 runs sampled)
isutf8 x 773,459 ops/sec ±0.08% (98 runs sampled)
------------------------------------------------------------
Preparing 64KB of random ASCII data
uv x 22,809 ops/sec ±0.07% (98 runs sampled)
utf-8-validate (5.0.5, C++) x 162,204 ops/sec ±0.07% (99 runs sampled)
utf-8-validate (simdutf, C++) x 1,193,138 ops/sec ±0.16% (98 runs sampled)
utf-8-validate (5.0.5, JS) x 12,569 ops/sec ±0.06% (100 runs sampled)
isutf8 x 12,549 ops/sec ±0.10% (99 runs sampled)
------------------------------------------------------------
Preparing 1MB of random ASCII data
uv x 1,428 ops/sec ±0.06% (98 runs sampled)
utf-8-validate (5.0.5, C++) x 10,423 ops/sec ±0.06% (98 runs sampled)
utf-8-validate (simdutf, C++) x 79,672 ops/sec ±0.62% (87 runs sampled)
utf-8-validate (5.0.5, JS) x 785 ops/sec ±0.08% (97 runs sampled)
isutf8 x 784 ops/sec ±0.09% (97 runs sampled)
------------------------------------------------------------
Preparing 4MB of random ASCII bytes
uv x 357 ops/sec ±0.07% (91 runs sampled)
utf-8-validate (5.0.5, C++) x 2,606 ops/sec ±0.13% (98 runs sampled)
utf-8-validate (simdutf, C++) x 8,026 ops/sec ±2.95% (82 runs sampled)
utf-8-validate (5.0.5, JS) x 196 ops/sec ±0.09% (90 runs sampled)
isutf8 x 196 ops/sec ±0.09% (90 runs sampled)
------------------------------------------------------------
Preparing all valid UTF-8 bytes ~4.17 MB
uv x 221 ops/sec ±0.07% (86 runs sampled)
utf-8-validate (5.0.5, C++) x 327 ops/sec ±0.16% (92 runs sampled)
utf-8-validate (simdutf, C++) x 3,220 ops/sec ±0.85% (92 runs sampled)
utf-8-validate (5.0.5, JS) x 147 ops/sec ±0.07% (84 runs sampled)
isutf8 x 146 ops/sec ±0.08% (83 runs sampled)
The https://github.com/simdutf/simdutf library includes a lot of features that we don't need. I wonder if it makes sense to fork it and remove everything but UTF-8 validation.