html5ever
html5ever copied to clipboard
parse_html adds unwanted tags like <html><head>...<body></html>
I want to use parse_document to create dom/vdom patches but the parse_document(...) keeps adding <html> and <body>. I wonder, is there an option to fine-tune the error correction level? I like that it does add a </title> in the example below.
But for creating a virtual-dom patch on a <div id="here"> it is bad to have to filter the html tags out afterwards.
/// parse none-escaped html strings as "Hello world!" into a node tree (see also raw_html(...))
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
let dom: RcDom = parse_document(RcDom::default(), Default::default()).one(html);
if let Some(body) = find_body(&dom.document) {
let new_document = Rc::new(markup5ever_rcdom::Node {
data: NodeData::Document,
parent: Cell::new(None),
children: body.children.clone(),
});
process_handle(&new_document)
} else {
Err(ParseError::NoBodyInParsedHtml)
}
}
// Recursively find the <body> element
fn find_body(handle: &Handle) -> Option<Handle> {
match &handle.data {
NodeData::Element { name, .. } if name.local.as_ref() == "body" => Some(handle.clone()),
_ => {
for child in handle.children.borrow().iter() {
if let Some(body) = find_body(child) {
return Some(body);
}
}
None
}
}
}
However, my problem is that I also want to parse html with a <html>...</html> tag in it and then it gets removed.
html-driver.rs test
#[test]
fn from_utf8() {
let dom = driver::parse_document(RcDom::default(), Default::default())
.from_utf8()
.one("<title>Test".as_bytes());
let mut serialized = Vec::new();
let document: SerializableHandle = dom.document.clone().into();
serialize::serialize(&mut serialized, &document, Default::default()).unwrap();
assert_eq!(
String::from_utf8(serialized).unwrap().replace(' ', ""),
"<html><head><title>Test</title></head><body></body></html>"
);
}
Update:
parse_fragment is also adding unwanted html.
You likely want parse_fragment
@nicoburns thanks for the heads-up!
Using parse_fragment it still would add a <html> tag. bummer ;-)
I've implemented it like this:
pub fn parse_html<MSG>(html: &str) -> Result<Option<Node<MSG>>, ParseError> {
let dom: RcDom = parse_fragment(RcDom::default(), Default::default(),
QualName::new(None, ns!(html), local_name!("div")),
vec![],
).one(html);
process_handle(&dom.document)
and my test shows now:
#[test]
fn test_pre_code3() {
let html = r#"<div><p> test </p><pre><code>
0
1
2
3
</code></pre>
</div>"#;
let expected = r#"<div><p> test </p><pre><code>
0
1
2
3
</code></pre><!--separator-->
</div>"#;
let node: Node<()> = parse_html(html).ok().flatten().expect("must parse");
//println!("node: {:#?}", node);
println!("html: {}", html);
println!("render: {}", node.render_to_string());
assert_eq!(expected, node.render_to_string());
}
output:
---- test_pre_code3_paragraphs_mix stdout ----
--- <code>
0
<p>1</p>
2
<p>3</p>
4
</code> ---
html: <div><p> test </p><pre><code>
0
<p>1</p>
2
<p>3</p>
4
</code></pre>
</div>
render: <html><div><p> test </p><pre><code>
0
<p>1</p>
2
<p>3</p>
4
</code></pre><!--separator-->
</div></html>
thread 'test_pre_code3_paragraphs_mix' panicked at tests/html_parser_test.rs:130:5:
assertion `left == right` failed
left: "<div><p> test </p><pre><code>\n 0\n <p>1</p>\n 2\n<p>3</p>\n 4\n</code></pre><!--separator-->\n</div>"
right: "<html><div><p> test </p><pre><code>\n 0\n <p>1</p>\n 2\n<p>3</p>\n 4\n</code></pre><!--separator-->\n</div></html>"
Updates:
Found the function which adds the <html>...</html> tags. I wonder if this can be made optional, maybe with a new TreeBuilderOpts argument.
html5ever/mod.rs
pub fn new_for_fragment(
sink: Sink,
context_elem: Handle,
form_elem: Option<Handle>,
opts: TreeBuilderOpts,
) -> TreeBuilder<Handle, Sink> {
println!("new_for_fragment");
...
// https://html.spec.whatwg.org/multipage/#parsing-html-fragments
// 5. Let root be a new html element with no attributes.
// 6. Append the element root to the Document node created above.
// 7. Set up the parser's stack of open elements so that it contains just the single element root.
tb.create_root(vec![]);
println!("new_for_fragment 3");
html5ever/tree_builder.rs
//§ creating-and-inserting-nodes
fn create_root(&self, attrs: Vec<Attribute>) {
let elem = create_element(
&self.sink,
QualName::new(None, ns!(html), local_name!("html")),
attrs,
);
self.push(&elem);
self.sink.append(&self.doc_handle, AppendNode(elem));
// FIXME: application cache selection algorithm
}
I'm hesitant to make this a configurable behaviour, since that takes us away from the HTML parsing specification that this crate implements. This seems like a post-processing step that would be better implemented by crates that use this library, using manual DOM operations on the parsed tree.
@jdm I couldn't find a formal definition in the html5ever source code what a html fragment actually is.
For my intuition the function parse_fragment does two things in one: 1. parse and fix a section of html like <p>foo</p> or even <html>....</html> and 2. fix it into a standalone document.
This is the 'collective' intuition represented in chatgpt: https://chatgpt.com/share/67f525ed-c53c-800c-8f49-85b6064226e7 and TL;DR A fragment means a snippet of HTML meant to live inside an existing HTML element, not a standalone HTML document.
So I kindly ask you this: Could you please add a 'formal' comment on the parse_fragment implementation what it actually does, so that other developers like me don't get confused, too.
If you don't want to alter the implementation in any way, since you like it as it is, then just close this ticket. I can go with the hack you proposed. Thanks!
https://html.spec.whatwg.org/multipage/parsing.html#parsing-html-fragments
In particular steps 7, 8 and 16.
@jdm
Regarding Step 16:
Return root's children, in tree order.
Doesn't "return the root's children" imply that the root itself (<html> element) is not returned, only it's children?
That makes me wonder if it's possible to parse a fragment that has multiple elements at the fragment root. What do you return in that case, if not the artificial root mode?
Mmm, I see that users of that algorithm expect a list of direct children that they can iterate and append to some other tree. Interesting.
I mean, one presumes this API is for parsing a https://developer.mozilla.org/en-US/docs/Web/API/DocumentFragment. Which is is it's own entity which provides a container for a list of children, but doesn't have the same enforced structure (<html><head><body>) as an actual Document. When appending a DocumentFragment to a node in a document, the DocumentFragment disappears and the list of children are appended instead.
I was able to implement setInnerHTML using html5ever by working around this issue by manually moving the children of the created <html> node to the target node (and then deleting the <html> node) after parsing https://github.com/DioxusLabs/blitz/blob/b6bdfa7625030e9363e6773e0fb5a14cfe6b44b4/packages/blitz-html/src/html_sink.rs#L127
This works well enough, but we may want to consider a better API for this use case.