diffhtml
diffhtml copied to clipboard
Tracking new HTML parser work
I am currently in the process of rewriting the >5 year old HTML parser that currently exists. The existing parser is a fork of node-fast-html-parser that is stripped down. Unfortunately the regexes are unnecessarily complex and the code is hard to work on. Instead I'm rewriting the parser to use a modern tokenizer approach and be zero-copy as possible for large payloads. I'm iterating the design with strong TDD, so I anticipate hundreds of new unit tests once this is complete.
Feature progress:
- [X] Significantly more reliable, fix bugs that currently exist in the parser, lots of unit tests
- [X] Support HTML comments
- [X] Smaller code footprint that is more specific towards VDOM
- [ ] Better middleware introspection for the parser, helping the linter plugin
Future of the parser:
~Post 1.0 launch, I want to invest time planning and building a parser compiled to WebAssembly that can then be plugged into any framework/runtime. This will not use regular expressions or anything hacky like the current parser. I think I'll need to solicit donations for that particular project or find some really passionate engineers who can help.~ Turns out this was easier than anticipated and will be added for the 1.0 slate.
@tbranyen – does the new parser support multiline attributes? I remember that being an issue with the current one, though I am not 100% sure.
<button title="a
b
c">x</button>
Looks good with the latest parser:
tim in ~/git/diffhtml/packages/diffhtml on fix-createstate-between-render (home) cat test.js
import { innerHTML, html, Internals } from './index.js';
//import { parse } from '../diffhtml-rust-parser/dist/parser.js';
//Internals.parse = parse;
console.log(html`
<button title="a
b
c">x</button>
`);
tim in ~/git/diffhtml/packages/diffhtml on fix-createstate-between-render (home) node test.js
{
rawNodeName: 'button',
nodeName: 'button',
nodeValue: '',
nodeType: 1,
key: '',
childNodes: [
{
rawNodeName: '#text',
nodeName: '#text',
nodeValue: 'x',
nodeType: 3,
key: '',
childNodes: [],
attributes: {}
}
],
attributes: { title: 'a\nb\nc' }
}
Looks good with WASM Rust parser as well.