html5ever icon indicating copy to clipboard operation
html5ever copied to clipboard

Add a method in TreeSink trait which will be called when an element ends

Open nearsyh opened this issue 10 years ago • 12 comments

For example, give an html code like

<div>
<!-- some code -->
</div>

The function will be called when the parser reaches the end tag.

nearsyh avatar Jul 13 '15 06:07 nearsyh

I’ll have to look at the spec, but I’m not sure that’s well-defined given all of HTML’s error recovery (e.g. optional closing tags). What’s the use case?

SimonSapin avatar Jul 13 '15 06:07 SimonSapin

This may not be necessary, but I think it makes some work easier. For example,

<div>
  <h1>test</h1>
  <h2>test</h2>
  test
</div>

Adding this function can make it easy to gather all text in the outermost div.

nearsyh avatar Jul 13 '15 07:07 nearsyh

You seem to be trying to do event-based parsing. Given how the tree builder can call append_before_sibling, remove_from_parent, or reparent_children, I’m not sure that really works. (You might incorrect results.) You may have to collect the nodes in a tree data structure like https://github.com/SimonSapin/kuchiki before you can process them.

SimonSapin avatar Jul 13 '15 07:07 SimonSapin

I suspect @hsivonen and @sideshowbarker have ideas about such apis.

Ms2ger avatar Jul 13 '15 08:07 Ms2ger

I’m guessing that @nearsyh would like to have a "streaming" HTML parser, like SAX and StAX do for XML. So the question is, given the adoption agency algorithm and friends, is this possible to parse HTML incrementally by buffering less than the entire document? (Where parts of the document can be considered "done" and are not modified again.)

SimonSapin avatar Jul 13 '15 13:07 SimonSapin

It’s possible to make a buffered SAX API. @hsivonen wrote one for use with the htmlparser he made (the same htmlparser the source of which is also used by gecko as its HTML parser). The sources for that SAX API are at http://hg.mozilla.org/projects/htmlparser/file/default/src/nu/validator/saxtree

However, the code for that API causes the entire HTML document it parses to be buffered; it doesn’t do it by using any strategies to buffer less than the entire document, in the way described in https://github.com/servo/html5ever/issues/149#issuecomment-120936252.

I don’t actually know how practical it would be to try to implement a SAX/SAX-like event-based spec-conforming HTML-parsing API that buffered less than an entire document at one time. I think you could get some of it just by buffering all tables, but beyond that I don’t know what the other partial-buffering strategies would be. But I’m certain @hsivonen could give some insight on it.

(BTW, while the buffered mode is the default for @hsivonen’s SAX API, it also provides a fully-streaming (non-buffered) mode as an option. That’s actually the mode which the validator.nu code uses. However, in that mode, any markup it runs into that would require non-streaming parsing behavior—i.e., adoption agency algorithm and friends—causes a non-recoverable parse error.)

sideshowbarker avatar Jul 13 '15 16:07 sideshowbarker

See also https://github.com/inikulin/parse5/issues/26#issuecomment-113298544

sideshowbarker avatar Jul 14 '15 01:07 sideshowbarker

See http://krijnhoetmer.nl/irc-logs/whatwg/20150714#l-117 for a discussion I had with @gsnedders (one of the html5lib devs) about this.

It seems the sad reality is that, as he notes there, the only way to do a streaming API for spec-conformant HTML parsing is to either buffer everything or admit fatal errors for cases that require non-streaming behavior.

sideshowbarker avatar Jul 14 '15 02:07 sideshowbarker

My conclusion is that a conforming "streaming" (SAX-like) HTML parser is only doable with trade-offs that are not worth it. (Either bufferring up to the entire document, or introducing fatal errors.)

@nearsyh, could you confirm that’s what you were trying to do?

SimonSapin avatar Jul 14 '15 13:07 SimonSapin

@nox how does TreeSink::pop relate to this?

SimonSapin avatar May 03 '17 07:05 SimonSapin

I am not sure we properly call pop in all circumstances, but I guess it could be piggybacked for this feature.

nox avatar May 03 '17 07:05 nox

I am not sure we properly call pop in all circumstances, but I guess it could be piggybacked for this feature.

I also would like some way to tell when elements end and was hoping pop would help. Would it be straightforward to call it in all circumstances or are there blockers to that?

max-heller avatar Jun 26 '24 22:06 max-heller