diffhtml icon indicating copy to clipboard operation
diffhtml copied to clipboard

Tracking new HTML parser work

Open tbranyen opened this issue 2 years ago • 3 comments

I am currently in the process of rewriting the >5 year old HTML parser that currently exists. The existing parser is a fork of node-fast-html-parser that is stripped down. Unfortunately the regexes are unnecessarily complex and the code is hard to work on. Instead I'm rewriting the parser to use a modern tokenizer approach and be zero-copy as possible for large payloads. I'm iterating the design with strong TDD, so I anticipate hundreds of new unit tests once this is complete.

Feature progress:

  • [X] Significantly more reliable, fix bugs that currently exist in the parser, lots of unit tests
  • [X] Support HTML comments
  • [X] Smaller code footprint that is more specific towards VDOM
  • [ ] Better middleware introspection for the parser, helping the linter plugin

Future of the parser:

~Post 1.0 launch, I want to invest time planning and building a parser compiled to WebAssembly that can then be plugged into any framework/runtime. This will not use regular expressions or anything hacky like the current parser. I think I'll need to solicit donations for that particular project or find some really passionate engineers who can help.~ Turns out this was easier than anticipated and will be added for the 1.0 slate.

tbranyen avatar Apr 14 '22 19:04 tbranyen

@tbranyen – does the new parser support multiline attributes? I remember that being an issue with the current one, though I am not 100% sure.

<button title="a
b
c">x</button>

CetinSert avatar Jan 23 '23 21:01 CetinSert

Looks good with the latest parser:

tim in ~/git/diffhtml/packages/diffhtml on fix-createstate-between-render (home) cat test.js
import { innerHTML, html, Internals } from './index.js';
//import { parse } from '../diffhtml-rust-parser/dist/parser.js';

//Internals.parse = parse;

console.log(html`
<button title="a
b
c">x</button>
`);
tim in ~/git/diffhtml/packages/diffhtml on fix-createstate-between-render (home) node test.js
{
  rawNodeName: 'button',
  nodeName: 'button',
  nodeValue: '',
  nodeType: 1,
  key: '',
  childNodes: [
    {
      rawNodeName: '#text',
      nodeName: '#text',
      nodeValue: 'x',
      nodeType: 3,
      key: '',
      childNodes: [],
      attributes: {}
    }
  ],
  attributes: { title: 'a\nb\nc' }
}

tbranyen avatar Jan 23 '23 23:01 tbranyen

Looks good with WASM Rust parser as well.

tbranyen avatar Jan 23 '23 23:01 tbranyen