entities Confused - why is a trie used here?

I don't see why a trie is used here. Tries are good when prefixes group items logically or when you're genuinely receiving one character at a time, but I don't think this happens in HTML entities. Using a trie is making the bundle size larger:

Compression	decode-data-html.ts	entities.json + legacy.json
None	`47,759b`	`34,333b` (72%)
`gzip -c`	`19,589b`	`11,620b` (59%)
`zstd --ultra -22`	`16,651b`	`10,730b` (64%)

This isn't isolated. entities is substantially larger than html-entities (compare: entities vs html-entities), and takes up half of the htmlparser2 bundle (visualization) because of that.

Aug 15 '25 18:08 KTibow

An HTML parser is operating on a token stream, one character at a time, so embedding a trie is quite natural.

The trie is a considerable speed-up of the parser, and reduces the memory footprint.

Agreed that the bundle size overhead of the trie is not ideal, happy for any improvements to how it is stored/packaged that doesn't impact performance.

Aug 16 '25 07:08 fb55

Would it be too much of a slowdown to build the trie upon load?

Aug 16 '25 11:08 KTibow

The trie takes a few seconds to compute, so very likely, yes. We could skip some expensive optimizations, but not sure what the upside would be.

Closest alternative is probably to ship a WASM binary instead of the current JS file

Aug 16 '25 16:08 fb55

I'm not very experienced with HTML entities / parser design but I made a proof of concept, tinyentities, of better compression. I hope something there helps - here's a few of the most important parts:

Pack the named entity mapping tightly. The first few lines look like
```
9Tab
NewLine
23excl
QUOT!>0quot!
```
which means "increment 9 on the first byte counter, there's &Tab;, increment 1 (implicit), there's &NewLine;, increment 23, there's &excl;, increment 1 (implicit), there's &QUOT; and &QUOT packed into one, then " and &quot also packed in. It's very dense (8kb compressed) and effective.
Use one mapping. An 8kb mapping that you can make work both ways is better than, say, having to include both a 6kB encode only mapping and an 8kB decode mapping.
Keep things simple. 4 replaceAll calls is both faster to load and faster to run compared to a giant complex abstraction. A few regex + lookups aren't substantially slower than a trie for the purpose of encoding and decoding blocks of text, and both encoding and decoding lookup maps take under a millisecond to initialize. WASM seems irrelevant for these purposes.
But maybe not always. While I personally find tryReadHTML more intuitive to use and objectively smaller to bring in compared to EntityDecoder, it also objectively takes 2.5x the time to do the same task.

I might experiment with other (possibly more directly useful) things later, but wanted to write down what I learned so far.

Aug 16 '25 18:08 KTibow

Very interesting — thanks for sharing!

My design goal with entities is very much "make the parser use-case as fast as possible, then support other cases" — the trie is quite handy here. But as you've shown, there is likely still quite some room for improvements.

Aug 18 '25 21:08 fb55

@KTibow Thanks for the push — I went through and optimised the encoding of the en- and decode tries (in https://github.com/fb55/entities/pull/1944 and https://github.com/fb55/entities/pull/1948, by 32% and 44% respectively). Also found a better way of looking for characters to encode, which leads to a ~1.6x speed up for encodeHTML.

Published as 7.0.0.

Sep 06 '25 22:09 fb55

Nice work on the encode trie. Both smaller and faster!

But the decode trie...

	v6.0.1	v7.0.0
JS (`wc -c`)	47760b	32394b
JS (gzipped, `wc -c`)	19590b	19991b
Unpacked (`.split("").length`)	15242b	24150b

seems to have actually gotten larger. It's only smaller when raw, and only because of switching to base 64.

Not sure if this is an accident or a speed-improving decision.

Sep 07 '25 00:09 KTibow