html5ever icon indicating copy to clipboard operation
html5ever copied to clipboard

Malformed HTML parsed differently from browsers

Open demurgos opened this issue 1 year ago • 1 comments

I have an HTML file with markup that can be reduced to the following:

<html>
<body>
<div id="div0">
  <a hr
</div>
<div id="div1">
  <div id="div2"></div>
  <div id="div3">
    <a href="/">bar</a>
  </div>
</div>
</body>
</html>

Notice the truncated <a tag on line 4 (caused by an HTML fragment accidentally truncated in the DB).

If I create a file with this content, load it in Firefox and print the resulting DOM with document.getElementsByTagName("html")[0].outerHTML , Firefox returns:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </a><div id="div3"><a hr="" <="" div="">
    </a><a href="/">bar</a>
  </div>
</div>
</body></html>
  • The truncated link results in 3 nodes in the DOM
  • The well form tag with text bar is still present in the output

However, if I parse the input with html5ever and print back the result, I get:

<html><head></head><body>
<div id="div0">
  <a hr="" <="" div="">
</a><div id="div1"><a hr="" <="" div="">
  <div id="div2"></div>
  </div>


</div></body></html>
  • The truncated link only appears twice
  • The well-formed link with bar completely disappeared!

EDIT: See next message, there are still some differences but the ones here seem to be caused by the TreeSink impl I used, not the parser.

This difference in interpretation between Firefox/Chrome and html5ever is causing me issues when processing these documents to recover them. I'm well aware that the input is broken, but I would expect html5ever to produce the same structure as real browsers.


EDIT: Even smaller repro, removing the newline fixes the mismatch.

<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>

demurgos avatar Oct 01 '23 15:10 demurgos

Running the arena example, I actually get a result close to real browsers.

I added Debug to html5ever/examples/arena:

impl<'arena> std::fmt::Debug for Node<'arena> {
    fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("Node")
            .field("data", &self.data)
            .field("first_child", &self.first_child)
            .field("next_sibling", &self.next_sibling)
            .finish()
    }
}

And then executed:

$ cat ./malformed.html
<html><body><div><a hr</div><div><div></div>
<div><a href="/">bar</a></div></div></body></html>
$ cargo run --example arena < ./malformed.html

This produced a tree corresponding to:

<document>
  <html>
    <head></head>
    <body>
      <div>
        <a hr<="" div=""></a>
        <div>
          <a hr<="" div="">
            <div></div>
            "\n"
          </a>
          <div>
            <a hr<="" div=""></a>
            <a href="/">bar</a>
          </div>
        </div>
        "\n"
      </div>
    </body>
  </html>
</document>

The difference with real browsers is that:

  • there is a <div> inside the second anchor, while it's empty inside browsers.
  • the broken anchors have two attributes instead of three

Regarding the other differences, they may be caused by my TreeSink, I'm using html5ever through scraper so I'll check there too.

demurgos avatar Oct 01 '23 17:10 demurgos