named_characters: Use a DAFSA to reduce the compiled size of the data by 161KiB
Notes:
- There should be no difference in functionality after this PR
- The new
named_characters.zigcomes from https://github.com/squeek502/named-character-references, see its README for some more details
Before this commit, named_characters.zig contained three arrays to hold the named character reference data totaling 19,708 + 9,853 + 157,664 = 187,225 bytes or 182.84 KiB. After this commit, named_characters.zig contains 1 array and 1 packed array totaling 15,488 + 5,857 = 21,345 bytes or 20.84 KiB. This is a difference of 162 KiB, but in practice, the difference in executable size is around ~161 KiB.
A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. Codepoints.
Some resources:
- https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
- https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
- https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique)
- http://stevehanov.ca/blog/?id=115
In terms of performance, there is no difference in the average case (HTML with some but not many named character references):
'bench.exe' ran 1.00 ± 0.01 times faster than 'bench-dafsa.exe'
Benchmark code: https://gist.github.com/squeek502/f49f286ad346c5073b49f3173b810de9 Using the files from: https://github.com/AndreasMadsen/htmlparser-benchmark/tree/master/files
However, when microbenchmarking with HTML that contains tens of thousands of named character references, there is a slight regression:
Benchmark 1: microbench-release.exe
Time (mean ± σ): 32.5 ms ± 1.8 ms [User: 16.2 ms, System: 19.0 ms]
Range (min … max): 27.0 ms … 38.4 ms 56 runs
Benchmark 2: microbench-release-dafsa.exe
Time (mean ± σ): 37.1 ms ± 1.9 ms [User: 20.6 ms, System: 16.6 ms]
Range (min … max): 31.6 ms … 41.8 ms 50 runs
Summary
'microbench-release.exe' ran
1.14 ± 0.09 times faster than 'microbench-release-dafsa.exe'
Benchmark code and test file: https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca
This is because there is an added cost to getting the associated codepoints once a named character character reference has been matched, since with a DAFSA it requires re-walking the trie in order to calculate the unique index (minimal perfect hash) of the associated data.
Note also that I deleted the character_reference_data.json and generate_named_characters.zig tool and didn't replace it with anything, since the new named_characters.zig is generated externally in https://github.com/squeek502/named-character-references. Let me know if you'd like me to do something different with regards to that.
Another alternative would be to use https://github.com/squeek502/named-character-references via the Zig package manager. Let me know if that would be preferable to vendoring the generated file.
This is because there is an added cost to getting the associated codepoints once a named character character reference has been matched, since with a DAFSA it requires re-walking the trie in order to calculate the unique index (minimal perfect hash) of the associated data.
Added a commit that makes this no longer true:
Matcher now calculates the unique index as it walks the DAFSA and stores the unique index of the last match, which means:
- No longer need to store the matched string in order to look up the associated codepoints
- No longer need to rewalk the DAFSA in order to calculate the unique index
This simplifies the implementation in Tokenizer.namedCharacterReference and gets the performance of the DAFSA nearer to the pre-DAFSA performance when benchmarking HTML with tens of thousands of named character references.
Benchmark 1: microbench.exe
Time (mean ± σ): 28.1 ms ± 1.1 ms [User: 15.4 ms, System: 12.6 ms]
Range (min … max): 26.6 ms … 32.3 ms 62 runs
Benchmark 2: microbench-dafsa-no-rewalk.exe
Time (mean ± σ): 29.9 ms ± 1.1 ms [User: 18.1 ms, System: 13.8 ms]
Range (min … max): 28.5 ms … 32.9 ms 58 runs
Summary
'microbench.exe' ran
1.06 ± 0.06 times faster than 'microbench-dafsa-no-rewalk.exe'
I'm back with (hopefully) a more compelling PR, after spending way too much time on this particular corner of HTML tokenization.
The relevant changes:
- This PR now represents a performance increase as well (still uses a DAFSA, just with some modifications that you can read about here)
- Uses https://github.com/squeek502/named-character-references via the Zig package manager instead of vendoring the code (but whatever you prefer is fine with me, so let me know)
I also wrote a rather in-depth article on the topic if you're interested:
Slightly better named character reference tokenization than Chrome, Safari, and Firefox
And finally, here's what the benchmark from the OP looks like now compared to the master branch:
Benchmark 1 (275 runs): ./microbench
measurement mean ± σ min … max outliers delta
wall_time 36.3ms ± 1.16ms 35.9ms … 54.5ms 11 ( 4%) 0%
peak_rss 49.4MB ± 20.7KB 49.3MB … 49.4MB 7 ( 3%) 0%
cpu_cycles 46.9M ± 183K 45.8M … 47.4M 2 ( 1%) 0%
instructions 78.1M ± 2.86 78.1M … 78.1M 3 ( 1%) 0%
cache_references 2.53M ± 22.5K 2.46M … 2.60M 4 ( 1%) 0%
cache_misses 87.8K ± 2.41K 82.2K … 94.9K 3 ( 1%) 0%
branch_misses 318K ± 723 317K … 324K 8 ( 3%) 0%
Benchmark 2 (288 runs): ./microbench-dafsa
measurement mean ± σ min … max outliers delta
wall_time 34.7ms ± 1.48ms 33.6ms … 54.8ms 1 ( 0%) ⚡- 4.3% ± 0.6%
peak_rss 49.3MB ± 25.2KB 49.2MB … 49.3MB 11 ( 4%) - 0.3% ± 0.0%
cpu_cycles 37.6M ± 189K 36.2M … 38.4M 4 ( 1%) ⚡- 19.9% ± 0.1%
instructions 60.3M ± 2.91 60.3M … 60.3M 1 ( 0%) ⚡- 22.9% ± 0.0%
cache_references 2.52M ± 35.3K 2.44M … 2.83M 4 ( 1%) - 0.3% ± 0.2%
cache_misses 58.7K ± 2.33K 53.0K … 65.4K 1 ( 0%) ⚡- 33.1% ± 0.4%
branch_misses 247K ± 651 247K … 251K 9 ( 3%) ⚡- 22.2% ± 0.0%
EDIT: Updated the OP to be current as well.
That article was extremely deep. Probably deeper than anyone's ever gone into this topic. Your new implementation is definitely more well-thought out than the one I had. Thanks!