HTMLReader
HTMLReader copied to clipboard
Elements after self-closing tag will be added as child of previous tag
Hi,
In my experiment, this html <td><div style="clear: both;" /><img src="abc.jpg" /></td> will add image element as child of the div instead of td. Is this a bug?
Thanks for the awesome component!
Cheers, Joe
Hello! By my interpretation of the spec, I believe it is correct behaviour for the img element to end up a child of the div.
The brief explanation is that there are only a handful of tags that are considered self-closing in HTML, and div is not one of them, so your noble attempt to make a self-closing div is unfortunately for naught.
The longer explanation involves diving into the spec, and I would not blame you one bit if you lose interest and decide to move along with your life instead of reading this overly verbose description of one person reading a very long spec.
With that warning, let's go! I'll first make the assumption that we are somewhere within a table element by the time we encounter the example snippet you've pasted above. That means our journey starts with the parser seeing a start tag named td and transitioning to the "in cell" insertion mode.
Now we're at <div style="clear: both;" />. The tokenizer turns this into a start tag named div with its self-closing flag set (and an attribute, but I'll ignore that because it doesn't affect the parsing here). The tokenizer is pretty dumb and doesn't know whether you need the self-closing flag or not, so it cheerfully sets the flag to allow the parser to decide what to do about it.
If you're following along with the "in-cell" insertion mode in the spec, you'll find we reach the "anything else" case because nothing specifically deals with a div. That case says to process the token as if we're in the "in body" insertion mode.
Sooo we head over there and scroll down a lot and eventually we get to the rule for a start tag named one of: address, article, aside... div... with a list of a couple dozen applicable tags. It says to insert an HTML element, and most importantly for us in this case, there is no mention of acknowledging the self-closing flag. So it's as if that / never existed; it's utterly ignored, the div element is pushed on to the stack, and subsequent elements get inserted as children of the div.
For comparison, try scrolling down a few cases more to the one for "a start tag whose tag name is one of: area, br, embed, img, keygen, wbr". You'll see that one of the steps is to "acknowledge the token's self-closing flag, if it is set." If you make any of those tags self-closing, you'll find things work as you expect.
Does that make sense? Congrats for reading this far! Let me know if anything is unclear, or if I've misinterpreted anything.