entities icon indicating copy to clipboard operation
entities copied to clipboard

Confused - why is a trie used here?

Open KTibow opened this issue 5 months ago • 7 comments

I don't see why a trie is used here. Tries are good when prefixes group items logically or when you're genuinely receiving one character at a time, but I don't think this happens in HTML entities. Using a trie is making the bundle size larger:

Compression decode-data-html.ts entities.json + legacy.json
None 47,759b 34,333b (72%)
gzip -c 19,589b 11,620b (59%)
zstd --ultra -22 16,651b 10,730b (64%)

This isn't isolated. entities is substantially larger than html-entities (compare: entities vs html-entities), and takes up half of the htmlparser2 bundle (visualization) because of that.

KTibow avatar Aug 15 '25 18:08 KTibow

An HTML parser is operating on a token stream, one character at a time, so embedding a trie is quite natural.

The trie is a considerable speed-up of the parser, and reduces the memory footprint.

Agreed that the bundle size overhead of the trie is not ideal, happy for any improvements to how it is stored/packaged that doesn't impact performance.

fb55 avatar Aug 16 '25 07:08 fb55

Would it be too much of a slowdown to build the trie upon load?

KTibow avatar Aug 16 '25 11:08 KTibow

The trie takes a few seconds to compute, so very likely, yes. We could skip some expensive optimizations, but not sure what the upside would be.

Closest alternative is probably to ship a WASM binary instead of the current JS file

fb55 avatar Aug 16 '25 16:08 fb55

I'm not very experienced with HTML entities / parser design but I made a proof of concept, tinyentities, of better compression. I hope something there helps - here's a few of the most important parts:

  • Pack the named entity mapping tightly. The first few lines look like
    9Tab
    NewLine
    23excl
    QUOT!>0quot!
    
    which means "increment 9 on the first byte counter, there's 	, increment 1 (implicit), there's 
, increment 23, there's !, increment 1 (implicit), there's " and &QUOT packed into one, then " and &quot also packed in. It's very dense (8kb compressed) and effective.
  • Use one mapping. An 8kb mapping that you can make work both ways is better than, say, having to include both a 6kB encode only mapping and an 8kB decode mapping.
  • Keep things simple. 4 replaceAll calls is both faster to load and faster to run compared to a giant complex abstraction. A few regex + lookups aren't substantially slower than a trie for the purpose of encoding and decoding blocks of text, and both encoding and decoding lookup maps take under a millisecond to initialize. WASM seems irrelevant for these purposes.
  • But maybe not always. While I personally find tryReadHTML more intuitive to use and objectively smaller to bring in compared to EntityDecoder, it also objectively takes 2.5x the time to do the same task.

I might experiment with other (possibly more directly useful) things later, but wanted to write down what I learned so far.

KTibow avatar Aug 16 '25 18:08 KTibow

Very interesting — thanks for sharing!

My design goal with entities is very much "make the parser use-case as fast as possible, then support other cases" — the trie is quite handy here. But as you've shown, there is likely still quite some room for improvements.

fb55 avatar Aug 18 '25 21:08 fb55

@KTibow Thanks for the push — I went through and optimised the encoding of the en- and decode tries (in https://github.com/fb55/entities/pull/1944 and https://github.com/fb55/entities/pull/1948, by 32% and 44% respectively). Also found a better way of looking for characters to encode, which leads to a ~1.6x speed up for encodeHTML.

Published as 7.0.0.

fb55 avatar Sep 06 '25 22:09 fb55

Nice work on the encode trie. Both smaller and faster!

But the decode trie...

v6.0.1 v7.0.0
JS (wc -c) 47760b 32394b
JS (gzipped, wc -c) 19590b 19991b
Unpacked (.split("").length) 15242b 24150b

seems to have actually gotten larger. It's only smaller when raw, and only because of switching to base 64.

Not sure if this is an accident or a speed-improving decision.

KTibow avatar Sep 07 '25 00:09 KTibow