lol-html icon indicating copy to clipboard operation
lol-html copied to clipboard

Implied End Tags not Handled Correctly

Open mitsuhiko opened this issue 2 years ago • 7 comments

While working on #109 I noticed that one can create a bit of an odd situation when trying to rewrite tags in the presence of optional end tags.

Take this list:

<ul>
  <li>a
  <li>b
</ul>

When I select li and change the tag to div I end up with the following output:

<ul>
  <div>a
  <div>b
</div>

This tests shows the issue


#[test]
fn optional_end_tags() {
    let output = rewrite_element(b"<ul><li>a<li>b</ul>", UTF_8, "li", |el| {
        el.set_tag_name("div").unwrap();
    });
    assert_eq!(output, "<ul><div>a<div>b</ul>");
}

I believe the system would need to emit "virtual" end tags for when implicitly closed tags are encountered. I believe the system also does not handle this tag stack correctly otherwise which probably should confuse the css selector as well.

mitsuhiko avatar Nov 29 '21 15:11 mitsuhiko

So one hypothetical way to deal with this would be to add an implicit flag to EndTag. That flag would be true if while parsing a virtual end tag is created. Upon serializing, the tag is skipped on serializing if it is still valid to implicitly close it where it is. So the case above hidden this is actually being created:

<ul>
  <li>a
  </li><li>b
</li></ul>

The closing </li> tags are implicit. They would not be serialized normally. However if the conditions of the context in which it exists change (for instance because it in itself turns into a div as an example or because additional data is added after the element that would invalidate the implicit end tag conditions) the serializer would start emitting it.

I think creating the EndTag token at all times is preferrable because it makes on_end_tag and others much more useful without having to deal with bizarre corner cases that unfortunately exist in the real world.

mitsuhiko avatar Nov 29 '21 19:11 mitsuhiko

I have been looking into where to best address this. The selectors VM already maintains a stack of currently open elements. This means that in order for implicitly self closing elements not to confuse the selector VM the VM needs to be updated accordingly. As far as I can see the VM is entirely fed from the TransformController so in theory if one were to feed these virtual tags in there stuff should work I guess.

However I'm entirely unsure how the system works. In particular there is this tag sink and tree builder system internally which I'm not at all familiar with so I'm a bit stuck here with the limited time I have to spend on this.

If someone were to point me at the right place to add this logic I'm happy to provide a patch.

mitsuhiko avatar Nov 29 '21 20:11 mitsuhiko

So my current understanding is that tokens are effectively only produced via the TokenCapturer in Dispatcher::try_produce_token_from_lexeme. This in turn invokes TransformController::handle_token.

The main TransformController is the HtmlRewriteController. This is also the one who currently owns the SelectorMatchingVm which in turn maintains the open tag stack. So in theory the most logical place to emit these implied end tags were somewhere in there. For instance HtmlRewriteController::handle_start_tag and HtmlRewriteController::handle_end_tag could be amended to automatically issue such tags. The challenge here in part might be that if there were other implementations of TransformController (there are only for tracing and tests currently) they would have to replicate this logic.

The other challenge I see is that one effectively would need to either move the stack out of the SelectorMatchingVm, maintain a secondary stack or "abuse" the VM to inspect the tag state.

Going by the HTML5 spec the actual logic of the implicit end tag handling is specified in "13.2.6.3":

When the steps below require the UA to generate implied end tags, then, while the current node is a dd element, a dt element, an li element, an optgroup element, an option element, a p element, an rb element, an rp element, an rt element, or an rtc element, the UA must pop the current node off the stack of open elements.

If a step requires the UA to generate implied end tags but lists an element to exclude from the process, then the UA must perform the above steps as if that element was not in the above list.

These "generate implied end tags" callouts are happening on a few start tags (like li where then li is excluded) but also quite a few end tags but the exact algorithm is slightly different for those cases. I wonder if it still can be handled somewhat the same on end tag or if it would be necessary to adhere to the spec. In particular one would technically have to know in which of the many insertion modes one is.

The quickest hack were I believe to expose the token stack from the SelectorMatchingVm and to handle these implied tags in the VM. However then it still leaves the question when one wants to "upgrade" such an implied tag into a non implied tag if the tag name is modified or the surroundings are being changed (eg: by "appending" new HTML after the tag).

mitsuhiko avatar Nov 29 '21 21:11 mitsuhiko

Generating implied end tags is in general can't be done on streams and requires full AST. lol-html was never designed to operate on ASTs, rather it provides a simplistic "token tree" representation of the markup. Also, it goes against lol-html paradigm of rewriting/modifying only the content specified by selectors.

There are lots of other corner cases in HTML that we can hit once we step into implicit tag nesting territory, so it's better stick with the current paradigm of the "token tree". If you need something more advanced and streaming/memory consumption is not a concern I would suggest to use solutions that perform full HTML parsing (e.g. html5ever)

In your example I see the following:

<ul>
  <div>a
  <div>b
</div>

is </div> a typo or it was indeed the output?

inikulin avatar Dec 02 '21 02:12 inikulin

Generating implied end tags is in general can't be done on streams and requires full AST.

I take your word for it, but I'm not entirely sure why that is true. The system already needs to maintain some sense of tag stack and that stack should be sufficient to figure out if there are tags that need to close and such events can then be emitted as events.

Also, it goes against lol-html paradigm of rewriting/modifying only the content specified by selectors.

The issue is that on the one hand the selectors are wrong if the end tags are missing, on the other that there is an API to change the tag name which is not safe to use on tags which might have implied end tags.

is </div> a typo or it was indeed the output?

That is the output. I'm assuming it thinks that the </ul> end tag belongs to the innermost </li> and rewrites it to </div>.

mitsuhiko avatar Dec 02 '21 16:12 mitsuhiko

That is the output. I'm assuming it thinks that the end tag belongs to the innermost and rewrites it to .

ok, then this is not good.

The system already needs to maintain some sense of tag stack and that stack should be sufficient to figure out if there are tags that need to close and such events can then be emitted as events.

We wanted to keep "token tree" construction rules as simple as possible - it's much easier to explain/document simple nesting rules, also we need to do as little implicit modifications as possible due to the security considerations. Partial tree construction rules implementation (implied end tags in this case) is a slippery slope. It still will have lots of corner cases where it will do the wrong thing as we can't always have proper stack as spec requires us doing streaming parsing (see infamous Adoption Agency Algorithm as a prominent example).

With all that being said, considering how the rewriting is broken according to your example, I'm afraid we don't have any other choice than to try it. We shouldn't emit implied end tags as real tokens, instead let's just popup elements from the selector's vm stack according to the rules. The performance needs to be considered and we would probably need some new optimisations

inikulin avatar Dec 03 '21 09:12 inikulin

I agree that the stream based approach is unlikely to be able to cover all cases so I think this will largely come down to "how likely is certain type of HTML in the real world".

In my particular case I started running into these cases as implied end tags particularly around lists, tables and paragraphs are super common and I needed to respond to the "end of tag" situation myself. So just fixing the VM's stack won't be enough as currently the API does not provide a way to respond to the end of a tag which however is necessary in a few situations. And to some degree this has been recognized now with the introduction of #109.

mitsuhiko avatar Dec 04 '21 13:12 mitsuhiko