html5-php Invalid parsing result when head/body tag is missing

Consider this:

<html>Hello, This is a test.<br />Does it work this time?</html>

Imo, this is valid HTML and it is also parsed correctly by DOMDocument. However, HTML5 parser will ignore the first line of text. We're using loadHTML() method.

Even this one works with DOMDocument:

Hello, This is a test.<br />Does it work this time?

According to Mozilla documentation:

html: The start tag may be omitted if the first thing inside the element is not a comment.
body: The start tag may be omitted if the first thing inside it is not a space character, comment,

Reference: https://github.com/roundcube/roundcubemail/issues/6713#issuecomment-480320339

Apr 06 '19 06:04 alecpl

Can you post the References of the Mozilla documentation about this?

Apr 06 '19 06:04 goetas

https://html5.validator.nu/ says is not valid

Apr 06 '19 06:04 goetas

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/html https://developer.mozilla.org/en-US/docs/Web/HTML/Element/body

I tried https://validator.w3.org, it also returns an error, however it looks strange to me. "Element head is missing a required instance of child element title" while there's no head at all.

Apr 06 '19 07:04 alecpl

Official HTML5.2 documentation says:

A head element’s start tag may be omitted if the element is empty, or if the first thing inside the head element is an element.
A body element’s start tag may be omitted if the element is empty, or if the first thing inside the body element is not a space character or a comment, except if the first thing inside the body element is a meta, link, script, style, or template element.

Apr 06 '19 07:04 alecpl

Can you please post the exact references to the documentation instead the main links... is really hard to find the sentences you are referring

Apr 06 '19 07:04 goetas

Look for "Tag omission".

Apr 06 '19 07:04 alecpl

The document you have posted refers to the latest HTML 5.2 specs. This library implements most of the 5.0 specs. However I see that starting to adopt some of the more recent specifications is a good idea, so if you wish to fix this behavior, PR are welcome.

Apr 06 '19 07:04 goetas

The old HTML5 documentation is the same in this context:

https://www.w3.org/TR/2014/REC-html5-20141028/semantics.html#the-html-element https://www.w3.org/TR/2014/REC-html5-20141028/sections.html#the-body-element https://www.w3.org/TR/2014/REC-html5-20141028/dom.html#element-dfn-tag-omission

Apr 06 '19 07:04 alecpl

Also, don't miss the fact DOMDocument parses these correctly.

Apr 06 '19 07:04 alecpl

Good to know

Apr 06 '19 07:04 goetas

well, DOMDocument does not follow that much the HTML5 logic... is just a relaxed XML parser internally. DOMDocument is not much aware of the HTML5 specs

Apr 06 '19 07:04 goetas

Yeah, the main reason we switched from DOMDocument to this lib was to get better results. And in many cases the result is better, but this case obviously looks like a bug. Such "dummy" HTML code is not that uncommon in email world.

Apr 06 '19 07:04 alecpl

Parsing such chunks of HTML would be useful when dealing with some ajax responses containing partials when scraping the web.

Feb 06 '20 15:02 librevlad

Hi, adding another test case to this issue:

Using the native DOMDocument::loadHTML() implementation:

$doc = new DOMDocument();
$doc->loadHTML('<title>Foo');
echo $doc->saveHTML();

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><title>Foo</title></head></html>

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();

<html><title>Foo</title></html>

Feb 24 '20 16:02 ju1ius

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:

The result of evalutating $document->documentElement->tagName should always be the string html
The result of evalutating (new DOMXPath($document))->query('/html/head')->item(0)->tagName should always be the string head
The result of evalutating (new DOMXPath($document))->query('/html/body')->item(0)->tagName should always be the string body

Feb 24 '20 16:02 ju1ius

@ju1ius that is a valid point, see my comment https://github.com/Masterminds/html5-php/pull/182#issuecomment-632046708 for a possible solution

May 21 '20 11:05 goetas

Using this library's implementation:

$parser = new HTML(['disable_html_ns' => true]);
$doc = $parser->loadHTML('<title>Foo');
echo $doc->saveHTML();

<html><title>Foo</title></html>

<html><title>Foo</title></html> is valid.

Also, citing the spec for tag omission:

Omitting an element's start tag in the situations described below does not mean the element is not present; it is implied, but it is still there. For example, an HTML document always has a root html element, even if the string doesn't appear anywhere in the markup.

This implies that:
1. The result of evalutating `$document->documentElement->tagName` should always be the string `html`

2. The result of evalutating `(new DOMXPath($document))->query('/html/head')->item(0)->tagName` should always be the string `head`

3. The result of evalutating `(new DOMXPath($document))->query('/html/body')->item(0)->tagName` should always be the string `body`

The changes to achieve this are difficult and break several existing tests. Adding those elements means that they will also be output - as far as I'm aware it's not possible to parse but not output them... For starters, the document ends after Foo so you have to handle it here https://github.com/Masterminds/html5-php/blob/master/src/HTML5/Parser/DOMTreeBuilder.php#L570

May 21 '20 18:05 bytestream

html5-php html5-php copied to clipboard

Invalid parsing result when head/body tag is missing

html5-php
html5-php copied to clipboard