html5-php icon indicating copy to clipboard operation
html5-php copied to clipboard

CDATA and JavaScript gets mangled when manipulating XHTML

Open dac514 opened this issue 7 years ago • 7 comments

When loading a XHTML document with JavaScript into html5-php I expect the <script>//<![CDATA[ conversion to behave like it does in plain old \DOMDocument. Instead they get turned into htmlentities, breaking the JavaScript on our target platforms (ebook readers).

Expected:

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>

Actual:

<script>
//&lt;![CDATA[
for ( var i = 0; i &lt; 10; i++ ) {
	console.log(i);
}
//]]&gt;
</script>

< becomes &lt;, > becomes &gt;, ...

Test case:

$body = '<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Test</title>
</head>
<body>
<p>Hi There!</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>
</body>
</html>';

$d1 = new \DOMDocument();
$d1->loadXml( $body );
echo $d1->saveXML();

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();

dac514 avatar May 01 '18 15:05 dac514

I would like to submit a PR but I need clarification on: \Masterminds\HTML5\Parser\EventHandler::text.

    /**
     * A unit of parsed character data.
     *
     * Entities in this text are *already decoded*.
     */
    public function text($cdata);

What does Entities in this text are *already decoded* mean? If it means that $cdata should not be manipulated then changing \Masterminds\HTML5\Parser\DOMTreeBuilder::text from:

// fprintf(STDOUT, "Appending text %s.", $data);
$node = $this->doc->createTextNode($data);
$this->current->appendChild($node);

To:

// fprintf(STDOUT, "Appending text %s.", $data);
$frag = $this->doc->createDocumentFragment();
$frag->appendXML($data);
$this->current->appendChild($frag);

Solves this issue.

DOMDocument->createTextNode() encodes entities. Is this the desired behaviour or not?

Source: https://stackoverflow.com/a/7725965

dac514 avatar Aug 22 '18 21:08 dac514

The thing is that "script" is a raw text tag, so Tokenizer::rawText() looks directly for instead of looking for a possible cdata sequence.

When serializing back the dom, are you using this lib or the domdocument::save xml method?

goetas avatar Aug 23 '18 07:08 goetas

I am using this lib:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();

Entities in this text are *already decoded, rawText, .... these methods and comments lead me to believe that if you pass then the result should be , not H&eacute;

My question, worded another way, is what is rawText? The code for rawText calls text and text creates the node using createTextNode which will transform entities.

If this is the case then my proposed fix is no good.

dac514 avatar Aug 23 '18 13:08 dac514

I am using this lib:

Ok, duh. No I am not. My mistake. When changing to:

$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
$mm->saveHTML( $d2 );

I get:

<!DOCTYPE html>
<html xml:lang="en"><head>
<title>Test</title>
</head>
<body>
<p>1</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {8
	console.log(i);
}
//]]>
</script>
<p>2</p>
</body>
</html>

Maybe this is #nobug? Thank you for your patience.

dac514 avatar Aug 23 '18 13:08 dac514

Doing

$html = '
<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>';

$html5 = new HTML5();
$doc = $html5->loadHTML($html);
var_dump($html5->saveHTML($doc));

I get:

<!DOCTYPE html>
<html>
    <head>
        <script>ababc</script>
        <script>ab < abc</script>
        <script>
        // <![CDATA[
            foo <
        // ]]>
        bar
        </script>
    </head>
</html>

goetas avatar Aug 23 '18 14:08 goetas

Back from vacation so excuse the all over the place tangent.

Elsewhere in our code, we have often used $mm->saveHTML( $d2 ) where $mm equals \Masterminds\HTML5

For this specific bug we need to save back as valid XML. We expect similar results to \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testMoveNonInlineElements

I just tried the proposed fix and, for this specific bug, and it doesn' work for us because <br/> get turned into <br> which breaks our XSLT.

The issue remains. We need

<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
	console.log(i);
}
//]]>
</script>

Preserved in XML saves like \DOMDocument does it.

dac514 avatar Aug 23 '18 14:08 dac514

$html5->saveHTML is there exactly because the DOM version did not work as expected in many other cases... Re opening as "feature request"

goetas avatar Aug 23 '18 14:08 goetas