WHATWG-compliant?
Does this parser attempt to follow the spec that browsers follow?
Rust crates that follow WHATWG (I think this list is complete):
- lol-html
- html5ever
- html5gum
In its benchmarking suite, tl compares itself against both kinds of parsers, ones that do attempt to comply to the WHATWG spec and parsers who don't. Since WHATWG defines error-recovery etc very precisely, that influences what kind of optimizations one can do, and explains why html5ever is slow.
thanks for bringing this up, it's a good point and I think it's important to add this to the README. Seems a little unfair to make crates like html5ever look "bad" in the benchmarks even though the reason for that is probably that they closely follow the spec as you say, and comparing it to this crate. This is mentioned in the separate benchmark repo (linked in the README), but it's kind of hidden behind a wall of text (and not mentioned here), which is unfortunate. Currently, this crate doesn't follow the full spec. When I made this crate, I needed a fast library for a project, something that can parse "sane" HTML documents very quickly (doesn't need to be spec compliant - it just needs to be able to parse the typical document) and provides a simple API to interact with the parsed tree. Hopefully in the future we can work towards being whatwg-compliant, without losing too much performance. Also html5gum looks cool ;)
I added a few things to the README, hoping that it makes the goals of this crate more clear. I've also fixed the benchmarks section up a bit. The table now has a column for "follows spec" (whether compliance with the spec is a goal) and a "note", saying that it's important to understand what difference it makes for performance if one isn't bound to a specification.