rem named_characters: Use a DAFSA to reduce the compiled size of the data by 161KiB

Notes:

There should be no difference in functionality after this PR
The new named_characters.zig comes from https://github.com/squeek502/named-character-references, see its README for some more details

Before this commit, named_characters.zig contained three arrays to hold the named character reference data totaling 19,708 + 9,853 + 157,664 = 187,225 bytes or 182.84 KiB. After this commit, named_characters.zig contains 1 array and 1 packed array totaling 15,488 + 5,857 = 21,345 bytes or 20.84 KiB. This is a difference of 162 KiB, but in practice, the difference in executable size is around ~161 KiB.

A DAFSA (deterministic acyclic finite state automaton) is essentially a trie flattened into an array, but it also uses techniques to minimize redundant nodes. This provides fast lookups while minimizing the required data size, but normally does not allow for associating data related to each word. However, by adding a count of the number of possible words from each node, it becomes possible to also use it to achieve minimal perfect hashing for the set of words (which allows going from word -> unique index as well as unique index -> word). This allows us to store a second array of data so that the DAFSA can be used as a lookup for e.g. Codepoints.

Some resources:

https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_state_automaton
https://web.archive.org/web/20220722224703/http://pages.pathcom.com/~vadco/dawg.html
https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5272 (this is the origin of the minimal perfect hashing technique)
http://stevehanov.ca/blog/?id=115

In terms of performance, there is no difference in the average case (HTML with some but not many named character references):

'bench.exe' ran 1.00 ± 0.01 times faster than 'bench-dafsa.exe'

Benchmark code: https://gist.github.com/squeek502/f49f286ad346c5073b49f3173b810de9 Using the files from: https://github.com/AndreasMadsen/htmlparser-benchmark/tree/master/files

However, when microbenchmarking with HTML that contains tens of thousands of named character references, there is a slight regression:

Benchmark 1: microbench-release.exe
  Time (mean ± σ):      32.5 ms ±   1.8 ms    [User: 16.2 ms, System: 19.0 ms]
  Range (min … max):    27.0 ms …  38.4 ms    56 runs

Benchmark 2: microbench-release-dafsa.exe
  Time (mean ± σ):      37.1 ms ±   1.9 ms    [User: 20.6 ms, System: 16.6 ms]
  Range (min … max):    31.6 ms …  41.8 ms    50 runs

Summary
  'microbench-release.exe' ran
    1.14 ± 0.09 times faster than 'microbench-release-dafsa.exe'

Benchmark code and test file: https://gist.github.com/squeek502/07b7dee1086f6e9dc38c4a880addfeca

This is because there is an added cost to getting the associated codepoints once a named character character reference has been matched, since with a DAFSA it requires re-walking the trie in order to calculate the unique index (minimal perfect hash) of the associated data.

Note also that I deleted the character_reference_data.json and generate_named_characters.zig tool and didn't replace it with anything, since the new named_characters.zig is generated externally in https://github.com/squeek502/named-character-references. Let me know if you'd like me to do something different with regards to that.

Another alternative would be to use https://github.com/squeek502/named-character-references via the Zig package manager. Let me know if that would be preferable to vendoring the generated file.

Jun 19 '24 06:06 squeek502

This is because there is an added cost to getting the associated codepoints once a named character character reference has been matched, since with a DAFSA it requires re-walking the trie in order to calculate the unique index (minimal perfect hash) of the associated data.

Added a commit that makes this no longer true:

Matcher now calculates the unique index as it walks the DAFSA and stores the unique index of the last match, which means:

No longer need to store the matched string in order to look up the associated codepoints
No longer need to rewalk the DAFSA in order to calculate the unique index

This simplifies the implementation in Tokenizer.namedCharacterReference and gets the performance of the DAFSA nearer to the pre-DAFSA performance when benchmarking HTML with tens of thousands of named character references.

Benchmark 1: microbench.exe
  Time (mean ± σ):      28.1 ms ±   1.1 ms    [User: 15.4 ms, System: 12.6 ms]
  Range (min … max):    26.6 ms …  32.3 ms    62 runs

Benchmark 2: microbench-dafsa-no-rewalk.exe
  Time (mean ± σ):      29.9 ms ±   1.1 ms    [User: 18.1 ms, System: 13.8 ms]
  Range (min … max):    28.5 ms …  32.9 ms    58 runs

Summary
  'microbench.exe' ran
    1.06 ± 0.06 times faster than 'microbench-dafsa-no-rewalk.exe'

Jun 19 '24 11:06 squeek502

I'm back with (hopefully) a more compelling PR, after spending way too much time on this particular corner of HTML tokenization.

The relevant changes:

This PR now represents a performance increase as well (still uses a DAFSA, just with some modifications that you can read about here)
Uses https://github.com/squeek502/named-character-references via the Zig package manager instead of vendoring the code (but whatever you prefer is fine with me, so let me know)

I also wrote a rather in-depth article on the topic if you're interested:

Slightly better named character reference tokenization than Chrome, Safari, and Firefox

And finally, here's what the benchmark from the OP looks like now compared to the master branch:

Benchmark 1 (275 runs): ./microbench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          36.3ms ± 1.16ms    35.9ms … 54.5ms         11 ( 4%)        0%
  peak_rss           49.4MB ± 20.7KB    49.3MB … 49.4MB          7 ( 3%)        0%
  cpu_cycles         46.9M  ±  183K     45.8M  … 47.4M           2 ( 1%)        0%
  instructions       78.1M  ± 2.86      78.1M  … 78.1M           3 ( 1%)        0%
  cache_references   2.53M  ± 22.5K     2.46M  … 2.60M           4 ( 1%)        0%
  cache_misses       87.8K  ± 2.41K     82.2K  … 94.9K           3 ( 1%)        0%
  branch_misses       318K  ±  723       317K  …  324K           8 ( 3%)        0%
Benchmark 2 (288 runs): ./microbench-dafsa
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          34.7ms ± 1.48ms    33.6ms … 54.8ms          1 ( 0%)        ⚡-  4.3% ±  0.6%
  peak_rss           49.3MB ± 25.2KB    49.2MB … 49.3MB         11 ( 4%)          -  0.3% ±  0.0%
  cpu_cycles         37.6M  ±  189K     36.2M  … 38.4M           4 ( 1%)        ⚡- 19.9% ±  0.1%
  instructions       60.3M  ± 2.91      60.3M  … 60.3M           1 ( 0%)        ⚡- 22.9% ±  0.0%
  cache_references   2.52M  ± 35.3K     2.44M  … 2.83M           4 ( 1%)          -  0.3% ±  0.2%
  cache_misses       58.7K  ± 2.33K     53.0K  … 65.4K           1 ( 0%)        ⚡- 33.1% ±  0.4%
  branch_misses       247K  ±  651       247K  …  251K           9 ( 3%)        ⚡- 22.2% ±  0.0%

EDIT: Updated the OP to be current as well.

Jul 18 '25 09:07 squeek502

That article was extremely deep. Probably deeper than anyone's ever gone into this topic. Your new implementation is definitely more well-thought out than the one I had. Thanks!

Jul 23 '25 06:07 chadwain