html5ever
html5ever copied to clipboard
Malformed HTML parsed differently from browsers
I have an HTML file with markup that can be reduced to the following:
<html>
<body>
<div id="div0">
<a hr
</div>
<div id="div1">
<div id="div2"></div>
<div id="div3">
<a href="/">bar</a>
</div>
</div>
</body>
</html>
Notice the truncated <a
tag on line 4 (caused by an HTML fragment accidentally truncated in the DB).
If I create a file with this content, load it in Firefox and print the resulting DOM with document.getElementsByTagName("html")[0].outerHTML
, Firefox returns:
<html><head></head><body>
<div id="div0">
<a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
<div id="div2"></div>
</a><div id="div3"><a hr="" <="" div="">
</a><a href="/">bar</a>
</div>
</div>
</body></html>
- The truncated link results in 3 nodes in the DOM
- The well form tag with text
bar
is still present in the output
However, if I parse the input with html5ever and print back the result, I get:
<html><head></head><body>
<div id="div0">
<a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
<div id="div2"></div>
</div>
</div></body></html>
- The truncated link only appears twice
- The well-formed link with
bar
completely disappeared!
EDIT: See next message, there are still some differences but the ones here seem to be caused by the TreeSink
impl I used, not the parser.
This difference in interpretation between Firefox/Chrome and html5ever is causing me issues when processing these documents to recover them. I'm well aware that the input is broken, but I would expect html5ever to produce the same structure as real browsers.
EDIT: Even smaller repro, removing the newline fixes the mismatch.
<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
Running the arena example, I actually get a result close to real browsers.
I added Debug to html5ever/examples/arena
:
impl<'arena> std::fmt::Debug for Node<'arena> {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
f.debug_struct("Node")
.field("data", &self.data)
.field("first_child", &self.first_child)
.field("next_sibling", &self.next_sibling)
.finish()
}
}
And then executed:
$ cat ./malformed.html
<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
$ cargo run --example arena < ./malformed.html
This produced a tree corresponding to:
<document>
<html>
<head></head>
<body>
<div>
<a hr<="" div=""></a>
<div>
<a hr<="" div="">
<div></div>
"\n"
</a>
<div>
<a hr<="" div=""></a>
<a href="/">bar</a>
</div>
</div>
"\n"
</div>
</body>
</html>
</document>
The difference with real browsers is that:
- there is a
<div>
inside the second anchor, while it's empty inside browsers. - the broken anchors have two attributes instead of three
Regarding the other differences, they may be caused by my TreeSink
, I'm using html5ever
through scraper
so I'll check there too.