blueboat icon indicating copy to clipboard operation
blueboat copied to clipboard

HTML Parsing

Open mnutt opened this issue 3 years ago • 3 comments

I was curious if you had considered exposing something like DOMParser, or some other HTML parsing interface. I think in edge computing people often use it for rewriting outgoing HTML, though I was hoping to use it just to extract a few HTML attributes.

It looks like rust has some decent HTML parsers (https://github.com/y21/tl) but in looking at the other blueboat interfaces exposed to v8 it seems like most are functional and don't hold any state, whereas the interface I was imagining might tokenize HTML in rust but also run queries in rust and just return the result to JS. But perhaps there's some better way to set up the interface?

mnutt avatar Feb 13 '22 18:02 mnutt

I considered an HTMLRewriter-like API backed by https://github.com/cloudflare/lol-html but a streaming rewriter doesn't feel as intuitive as the browser DOM; a proper browser-like DOM interface would be preferred.

Support for stateful native API was recently added to blueboat (https://github.com/losfair/blueboat/commit/123cc0c517e51613a04d9266cd62d0e80c38b223) so the DOM interface can be built on it. tl looks like a nice foundation for that!

losfair avatar Feb 14 '22 05:02 losfair

This is great! Agreed that DOM interface is much nicer to use.

mnutt avatar Feb 14 '22 15:02 mnutt

Basic DOM operations on HTML and XML documents are now implemented (https://github.com/losfair/blueboat/pull/71).

The API looks like:

let dom = TextUtil.DOM.HTMLDOMNode.parse('<div><p class="some-class">Test</p></div>', { fragment: true });
dom.queryWithFilter({type: "hasClass", className: "some-class"}, elem => {
  const props = elem.get();
  props.attrs.push({name: "data-test", value: "42"});
  elem.update(props);
  return true;
});
new TextDecoder().decode(dom.serialize());

(not yet the final API, still need some design around it)

losfair avatar Feb 15 '22 09:02 losfair