html5ever icon indicating copy to clipboard operation
html5ever copied to clipboard

parse_html adds unwanted tags like <html><head>...<body></html>

Open qknight opened this issue 8 months ago • 3 comments

I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.

But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.


/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
    if let Some(body) = find_body(&dom.document) {
        let new_document = Rc::new(markup5ever_rcdom::Node {
            data: NodeData::Document,
            parent: Cell::new(None),
            children: body.children.clone(),
        });
        process_handle(&new_document)
    } else {
        Err(ParseError::NoBodyInParsedHtml)
    }
}

// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
    match &handle.data {
        NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
        _ => {
            for child in handle.children.borrow().iter() {
                if let Some(body) = find_body(child) {
                    return Some(body);
                }
            }
            None
        }
    }
}

However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.

html-driver.rs test

#[test]
fn from_utf8() {
    let dom = driver::parse_document(RcDom::default(), Default::default())
        .from_utf8()
        .one("<title>Test".as_bytes());
    let mut serialized = Vec::new();
    let document: SerializableHandle = dom.document.clone().into();
    serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
    assert_eq!(
        String::from_utf8(serialized).unwrap().replace(' ', ""),
        "<html><head><title>Test</title></head><body></body></html>"
    );
}

Update:

parse_fragment is also adding unwanted html.

qknight avatar Mar 13 '25 03:03 qknight

You likely want parse_fragment

nicoburns avatar Mar 13 '25 03:03 nicoburns

@nicoburns thanks for the heads-up!

Using parse_fragment it still would add a <html> tag. bummer ;-)

I've implemented it like this:

pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
    let dom: RcDom = parse_fragment(RcDom::default(), Default::default(),
    QualName::new(None, ns!(html), local_name!("div")),
    vec![],
).one(html);
    process_handle(&dom.document)

and my test shows now:

#[test]
fn test_pre_code3() {
    let html = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
  1
  2
3
</code></pre><!--separator-->
</div>"#;

    let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
    //println!("node: {:#?}", node);
    println!("html: {}", html);
    println!("render: {}", node.render_to_string());
    assert_eq!(expected, node.render_to_string());
}

output:

---- test_pre_code3_paragraphs_mix stdout ----
--- <code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code> ---
html: <div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre>
</div>
render: <html><div><p> test </p><pre><code>
  0
  <p>1</p>
  2
<p>3</p>
  4
</code></pre><!--separator-->
</div></html>
thread 'test_pre_code3_paragraphs_mix' panicked at tests/html_parser_test.rs:130:5:
assertion `left == right` failed
  left: "<div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div>"
 right: "<html><div><p> test </p><pre><code>\n  0\n  <p>1</p>\n  2\n<p>3</p>\n  4\n</code></pre><!--separator-->\n</div></html>"

qknight avatar Mar 14 '25 02:03 qknight

Updates:

Found the function which adds the <html>...</html> tags. I wonder if this can be made optional, maybe with a new TreeBuilderOpts argument.

html5ever/mod.rs

    pub fn new_for_fragment(
        sink: Sink,
        context_elem: Handle,
        form_elem: Option<Handle>,
        opts: TreeBuilderOpts,
    ) -> TreeBuilder<Handle, Sink> {
        println!("new_for_fragment");
...
        // https://html.spec.whatwg.org/multipage/#parsing-html-fragments
        // 5. Let root be a new html element with no attributes.
        // 6. Append the element root to the Document node created above.
        // 7. Set up the parser's stack of open elements so that it contains just the single element root.
        tb.create_root(vec![]);
        println!("new_for_fragment 3");

html5ever/tree_builder.rs

    //§ creating-and-inserting-nodes
    fn create_root(&self, attrs: Vec<Attribute>) {
        let elem = create_element(
            &self.sink,
            QualName::new(None, ns!(html), local_name!("html")),
            attrs,
        );
        self.push(&elem);
        self.sink.append(&self.doc_handle, AppendNode(elem));
        // FIXME: application cache selection algorithm
    }

qknight avatar Mar 14 '25 14:03 qknight

I'm hesitant to make this a configurable behaviour, since that takes us away from the HTML parsing specification that this crate implements. This seems like a post-processing step that would be better implemented by crates that use this library, using manual DOM operations on the parsed tree.

jdm avatar Apr 04 '25 20:04 jdm

@jdm I couldn't find a formal definition in the html5ever source code what a html fragment actually is.

For my intuition the function parse_fragment does two things in one: 1. parse and fix a section of html like <p>foo</p> or even <html>....</html> and 2. fix it into a standalone document.

This is the 'collective' intuition represented in chatgpt: https://chatgpt.com/share/67f525ed-c53c-800c-8f49-85b6064226e7 and TL;DR A fragment means a snippet of HTML meant to live inside an existing HTML element, not a standalone HTML document.

So I kindly ask you this: Could you please add a 'formal' comment on the parse_fragment implementation what it actually does, so that other developers like me don't get confused, too.

If you don't want to alter the implementation in any way, since you like it as it is, then just close this ticket. I can go with the hack you proposed. Thanks!

qknight avatar Apr 08 '25 13:04 qknight

https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments

In particular steps 7, 8 and 16.

jdm avatar Apr 08 '25 15:04 jdm

@jdm

Regarding Step 16:

Return root's children, in tree order.

Doesn't "return the root's children" imply that the root itself (<html> element) is not returned, only it's children?

nicoburns avatar Apr 08 '25 18:04 nicoburns

That makes me wonder if it's possible to parse a fragment that has multiple elements at the fragment root. What do you return in that case, if not the artificial root mode?

jdm avatar Apr 08 '25 18:04 jdm

Mmm, I see that users of that algorithm expect a list of direct children that they can iterate and append to some other tree. Interesting.

jdm avatar Apr 08 '25 18:04 jdm

I mean, one presumes this API is for parsing a https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment. Which is is it's own entity which provides a container for a list of children, but doesn't have the same enforced structure (<html><head><body>) as an actual Document. When appending a DocumentFragment to a node in a document, the DocumentFragment disappears and the list of children are appended instead.

nicoburns avatar Apr 08 '25 19:04 nicoburns

I was able to implement setInnerHTML using html5ever by working around this issue by manually moving the children of the created <html> node to the target node (and then deleting the <html> node) after parsing https://github.com/DioxusLabs/blitz/blob/b6bdfa7625030e9363e6773e0fb5a14cfe6b44b4/packages/blitz-html/src/html_sink.rs#L127

This works well enough, but we may want to consider a better API for this use case.

nicoburns avatar Sep 04 '25 22:09 nicoburns