html5-php
html5-php copied to clipboard
CDATA and JavaScript gets mangled when manipulating XHTML
When loading a XHTML document with JavaScript into html5-php I expect the <script>//<![CDATA[ conversion to behave like it does in plain old \DOMDocument. Instead they get turned into htmlentities, breaking the JavaScript on our target platforms (ebook readers).
Expected:
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
console.log(i);
}
//]]>
</script>
Actual:
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
console.log(i);
}
//]]>
</script>
<becomes<,>becomes>, ...
Test case:
$body = '<?xml version="1.0" encoding="UTF-8" ?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Test</title>
</head>
<body>
<p>Hi There!</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
console.log(i);
}
//]]>
</script>
</body>
</html>';
$d1 = new \DOMDocument();
$d1->loadXml( $body );
echo $d1->saveXML();
$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();
I would like to submit a PR but I need clarification on: \Masterminds\HTML5\Parser\EventHandler::text.
/**
* A unit of parsed character data.
*
* Entities in this text are *already decoded*.
*/
public function text($cdata);
What does Entities in this text are *already decoded* mean? If it means that $cdata should not be manipulated then changing \Masterminds\HTML5\Parser\DOMTreeBuilder::text from:
// fprintf(STDOUT, "Appending text %s.", $data);
$node = $this->doc->createTextNode($data);
$this->current->appendChild($node);
To:
// fprintf(STDOUT, "Appending text %s.", $data);
$frag = $this->doc->createDocumentFragment();
$frag->appendXML($data);
$this->current->appendChild($frag);
Solves this issue.
DOMDocument->createTextNode() encodes entities. Is this the desired behaviour or not?
Source: https://stackoverflow.com/a/7725965
The thing is that "script" is a raw text tag, so Tokenizer::rawText() looks directly for instead of looking for a possible cdata sequence.
When serializing back the dom, are you using this lib or the domdocument::save xml method?
I am using this lib:
$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
echo $d2->saveXML();
Entities in this text are *already decoded, rawText, .... these methods and comments lead me to believe that if you pass Hé then the result should be Hé, not Hé
My question, worded another way, is what is rawText? The code for rawText calls text and text creates the node using createTextNode which will transform entities.
If this is the case then my proposed fix is no good.
I am using this lib:
Ok, duh. No I am not. My mistake. When changing to:
$mm = new \Masterminds\HTML5();
$d2 = $mm->loadHTML( $body );
$mm->saveHTML( $d2 );
I get:
<!DOCTYPE html>
<html xml:lang="en"><head>
<title>Test</title>
</head>
<body>
<p>1</p>
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {8
console.log(i);
}
//]]>
</script>
<p>2</p>
</body>
</html>
Maybe this is #nobug? Thank you for your patience.
Doing
$html = '
<!DOCTYPE html>
<html>
<head>
<script>ababc</script>
<script>ab < abc</script>
<script>
// <![CDATA[
foo <
// ]]>
bar
</script>
</head>
</html>';
$html5 = new HTML5();
$doc = $html5->loadHTML($html);
var_dump($html5->saveHTML($doc));
I get:
<!DOCTYPE html>
<html>
<head>
<script>ababc</script>
<script>ab < abc</script>
<script>
// <![CDATA[
foo <
// ]]>
bar
</script>
</head>
</html>
Back from vacation so excuse the all over the place tangent.
Elsewhere in our code, we have often used $mm->saveHTML( $d2 ) where $mm equals \Masterminds\HTML5
For this specific bug we need to save back as valid XML. We expect similar results to \Masterminds\HTML5\Tests\Parser\DOMTreeBuilderTest::testMoveNonInlineElements
I just tried the proposed fix and, for this specific bug, and it doesn' work for us because <br/> get turned into <br> which breaks our XSLT.
The issue remains. We need
<script>
//<![CDATA[
for ( var i = 0; i < 10; i++ ) {
console.log(i);
}
//]]>
</script>
Preserved in XML saves like \DOMDocument does it.
$html5->saveHTML is there exactly because the DOM version did not work as expected in many other cases...
Re opening as "feature request"