Confused - why is a trie used here?
I don't see why a trie is used here. Tries are good when prefixes group items logically or when you're genuinely receiving one character at a time, but I don't think this happens in HTML entities. Using a trie is making the bundle size larger:
| Compression | decode-data-html.ts | entities.json + legacy.json |
|---|---|---|
| None | 47,759b |
34,333b (72%) |
gzip -c |
19,589b |
11,620b (59%) |
zstd --ultra -22 |
16,651b |
10,730b (64%) |
This isn't isolated. entities is substantially larger than html-entities (compare: entities vs html-entities), and takes up half of the htmlparser2 bundle (visualization) because of that.
An HTML parser is operating on a token stream, one character at a time, so embedding a trie is quite natural.
The trie is a considerable speed-up of the parser, and reduces the memory footprint.
Agreed that the bundle size overhead of the trie is not ideal, happy for any improvements to how it is stored/packaged that doesn't impact performance.
Would it be too much of a slowdown to build the trie upon load?
The trie takes a few seconds to compute, so very likely, yes. We could skip some expensive optimizations, but not sure what the upside would be.
Closest alternative is probably to ship a WASM binary instead of the current JS file
I'm not very experienced with HTML entities / parser design but I made a proof of concept, tinyentities, of better compression. I hope something there helps - here's a few of the most important parts:
- Pack the named entity mapping tightly. The first few lines look like
which means "increment 9 on the first byte counter, there's9Tab NewLine 23excl QUOT!>0quot!	, increment 1 (implicit), there's
, increment 23, there's!, increment 1 (implicit), there's"and"packed into one, then"and"also packed in. It's very dense (8kb compressed) and effective. - Use one mapping. An 8kb mapping that you can make work both ways is better than, say, having to include both a 6kB encode only mapping and an 8kB decode mapping.
- Keep things simple. 4
replaceAllcalls is both faster to load and faster to run compared to a giant complex abstraction. A few regex + lookups aren't substantially slower than a trie for the purpose of encoding and decoding blocks of text, and both encoding and decoding lookup maps take under a millisecond to initialize. WASM seems irrelevant for these purposes. - But maybe not always. While I personally find
tryReadHTMLmore intuitive to use and objectively smaller to bring in compared to EntityDecoder, it also objectively takes 2.5x the time to do the same task.
I might experiment with other (possibly more directly useful) things later, but wanted to write down what I learned so far.
Very interesting — thanks for sharing!
My design goal with entities is very much "make the parser use-case as fast as possible, then support other cases" — the trie is quite handy here. But as you've shown, there is likely still quite some room for improvements.
@KTibow Thanks for the push — I went through and optimised the encoding of the en- and decode tries (in https://github.com/fb55/entities/pull/1944 and https://github.com/fb55/entities/pull/1948, by 32% and 44% respectively). Also found a better way of looking for characters to encode, which leads to a ~1.6x speed up for encodeHTML.
Published as 7.0.0.
Nice work on the encode trie. Both smaller and faster!
But the decode trie...
| v6.0.1 | v7.0.0 | |
|---|---|---|
JS (wc -c) |
47760b | 32394b |
JS (gzipped, wc -c) |
19590b | 19991b |
Unpacked (.split("").length) |
15242b | 24150b |
seems to have actually gotten larger. It's only smaller when raw, and only because of switching to base 64.
Not sure if this is an accident or a speed-improving decision.