markup.ml icon indicating copy to clipboard operation
markup.ml copied to clipboard

Some invalid HTML documents yield Failure("require_current_element: None")

Open Tchou opened this issue 4 years ago • 3 comments

The following document (reduced as much as possible) makes parse_html raise a failure "require_current_element: None". Note that the document is indeed invalid since xmp elements are both deprecated and should not appear inside a td element but still one should be able to perform error recovery. I'm guessing the failure exception should not leak to the client ?

<!DOCTYPE html>
<html>
<head><title></title></head>
<body>
    <table>
       <tr>
         <td>
           <xmp></xmp>
         </td>
       </tr>
    </table>

    <table >
      <tr>
        <td></td>
      </tr>
    </table>
    <code></code>
</body>

</html>

I tested it with markup 1.0.2, using the following program :

open Markup

let stream, _close = file "test.html"

let  parser =  parse_html 
        ~report:(fun _ _ -> ())
        ~context:`Document stream
    
let _ = iter ignore (signals parser)

Tchou avatar Nov 25 '21 14:11 Tchou

I haven't looked into the error in detail yet, but, indeed, exceptions should not leak to the library user, and the parser should be able to recover at least somehow from all errors.

aantron avatar Nov 25 '21 16:11 aantron

You may be able to follow along these rules in the HTML5 spec to get an idea of what the parser is required by the spec to do (that's what I'll do when debugging). The input is quite small (thank you!), so it should be doable. If you can spot the point at which the parser diverges, that would probably be the bug.

The error makes me suspect that this is an interaction with the "adoption agency algorithm", called on by the spec in various places. It is used to reorganize the document tree in response to various misnested elements. Markup.ml creates local subtrees to "buffer" parser stream output in case those subtrees might be affected by the "adoption agency algorithm," and streams those subtrees out on its final output stream once it is sure that the subtree is safe from modification by the algorithm.

aantron avatar Nov 25 '21 16:11 aantron

If you do try to follow the spec, and run into the "foster parenting algorithm," which is required in misnesting inside <table>, Markup.ml does not implement that algorithm per the conformance status. This is because the "foster parenting algorithm" can require very remote modifications to the document tree being parsed, which, in a streaming parser, means buffering a potentially huge amount of signals before emitting them — defeating the point of streaming.

As an aside, it is possible to implement the algorithm as a separate pass over the signal stream for people that need this level of conformance (at the price of buffering the document), but I don't think that's ever been done for Markup.ml.

aantron avatar Nov 25 '21 16:11 aantron