html5-php
html5-php copied to clipboard
Invalid parsing result when head/body tag is missing
Consider this:
<html>Hello, This is a test.<br />Does it work this time?</html>
Imo, this is valid HTML and it is also parsed correctly by DOMDocument. However, HTML5 parser will ignore the first line of text. We're using loadHTML() method.
Even this one works with DOMDocument:
Hello, This is a test.<br />Does it work this time?
According to Mozilla documentation:
- html: The start tag may be omitted if the first thing inside the element is not a comment.
- body: The start tag may be omitted if the first thing inside it is not a space character, comment,
Reference: https://github.com/roundcube/roundcubemail/issues/6713#issuecomment-480320339
Can you post the References of the Mozilla documentation about this?
https://html5.validator.nu/ says is not valid
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/html https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body
I tried https://validator.w3.org, it also returns an error, however it looks strange to me. "Element head is missing a required instance of child element title" while there's no head at all.
Official HTML5.2 documentation says:
- A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
- A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.
Can you please post the exact references to the documentation instead the main links... is really hard to find the sentences you are referring
Look for "Tag omission".
The document you have posted refers to the latest HTML 5.2 specs. This library implements most of the 5.0 specs. However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.
The old HTML5 documentation is the same in this context:
https://www.w3.org/TR/2014/REC-html5-20141028/semantics.html#the-html-element https://www.w3.org/TR/2014/REC-html5-20141028/sections.html#the-body-element https://www.w3.org/TR/2014/REC-html5-20141028/dom.html#element-dfn-tag-omission
Also, don't miss the fact DOMDocument parses these correctly.
Good to know
well, DOMDocument does not follow that much the HTML5 logic... is just a relaxed XML parser internally. DOMDocument is not much aware of the HTML5 specs
Yeah, the main reason we switched from DOMDocument to this lib was to get better results. And in many cases the result is better, but this case obviously looks like a bug. Such "dummy" HTML code is not that uncommon in email world.
Parsing such chunks of HTML would be useful when dealing with some ajax responses containing partials when scraping the web.
Hi, adding another test case to this issue:
Using the native DOMDocument::loadHTML() implementation:
$doc = new DOMDocument();
$doc->loadHTML('<title>Foo');
echo $doc->saveHTML();
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Foo</title></head></html>
Using this library's implementation:
$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();
<html><title>Foo</title></html>
Also, citing the spec for tag omission:
Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.
This implies that:
- The result of evalutating
$document->documentElement->tagNameshould always be the stringhtml - The result of evalutating
(new DOMXPath($document))->query('/html/head')->item(0)->tagNameshould always be the stringhead - The result of evalutating
(new DOMXPath($document))->query('/html/body')->item(0)->tagNameshould always be the stringbody
@ju1ius that is a valid point, see my comment https://github.com/Masterminds/html5-php/pull/182#issuecomment-632046708 for a possible solution
Using this library's implementation:
$parser = new HTML(['disable_html_ns' => true]); $doc = $parser->loadHTML('<title>Foo'); echo $doc->saveHTML();<html><title>Foo</title></html>
<html><title>Foo</title></html> is valid.
Also, citing the spec for tag omission:
Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.
This implies that:
1. The result of evalutating `$document->documentElement->tagName` should always be the string `html` 2. The result of evalutating `(new DOMXPath($document))->query('/html/head')->item(0)->tagName` should always be the string `head` 3. The result of evalutating `(new DOMXPath($document))->query('/html/body')->item(0)->tagName` should always be the string `body`
The changes to achieve this are difficult and break several existing tests. Adding those elements means that they will also be output - as far as I'm aware it's not possible to parse but not output them... For starters, the document ends after Foo so you have to handle it here https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L570