htmlpurifier
htmlpurifier copied to clipboard
HTML Comment <!-- <style> --> produces empty result
I have the following simplified PHP code, the HTML is coming from an untrusted source, and needs to be purified. This is a minimal example which reproduces the problem, my real HTMLPurifier config is a lot more complex.
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('HTML.Doctype', 'HTML 4.01 Transitional');
$config->set('Filter.ExtractStyleBlocks', true);
$purifier = new HTMLPurifier($config);
$dirtyHtml = <<<EOF
<!-- <style> -->
<style>
div {font-size: 12px;}
</style>
<div>
some text
</div>
EOF;
var_dump($purifier->purify($dirtyHtml));
The output is an empty string.
If I remove the "Filter.ExtractStyleBlocks" line, I get the correct output:
<div>
some text
</div>
If I remove the HTML Comment from the $dirtyHtml, it also works fine. The problem seems to be this HTML comment in combination with the Filter.ExtractStyleBlocks.
As a quick workaround I remove all occurrences of <!-- <style> --> from the $dirtyHtml before purifying it, but it's just a workaround for now, until HTMLPurifier is fixed.
This will be annoying to fix. The root cause is that we're regexing out all occurrences of
Maybe a Filter which removes all HTML comments (which are not HTML conditionals) could be nice. I guess HTML comments with complete HTML tags in it could also fail in other filters/situations while parsing?
Yeah, that seems reasonable, but I wouldn't want it to be automatically turned on by ExtractStyleBlocks, that's weird. Also, there's some interactions we'd have to be careful about; for example, sometimes people comment out CSS using HTML comments, and so if you regex them all out you'll remove the css too.
Maybe the regex of ExtractStyleBlocks can be improved, so that it doesn't match <style> tags inside HTML comments... But that's not easy I guess 😄
Another possibility would be to just "temporarly" remove the HTML comments inside ExtractStyleBlocks while doing the regex , and later restore the HTML with comments.
In https://github.com/ezyang/htmlpurifier/blob/master/library/HTMLPurifier/Filter/ExtractStyleBlocks.php#L101 instead of the preg_replace_callback line you could do something like this:
// remove all HTML comments, so that a <style> or </style> inside an HTML comment would not match in the following regex.
$htmlWithoutHtmlComments = preg_replace('#<!--.*?-->#', '', $html);
// now detect the <style> blocks
preg_match_all('#<style(?:\s.*)?>(.*)<\/style>#isU', $htmlWithoutHtmlComments, $matches);
foreach ($matches[0] as $i => $match) {
// store the style blocks
$this->styleCallback($matches[1][$i]);
// remove the style block in the original $html (where the HTML comments are still in)
$html = str_replace($match, '', $html);
}
That could do the trick, and fix the bug?
The regex to detect HTML comments is not perfect yet, it also removes HTML conditionals (including its inner HTML code). Not sure if that's a good idea here...