html5-php
html5-php copied to clipboard
Parser remove the single < (less than) character from given html string
When parsing a html string with single use of <, it removes it from the parsed value that being returned . For example
<?php
use Masterminds\HTML5;
$html = '<img src="invalid-url" onerror="alert(\'XSS Attack prefix\')" /> 2 > 1 & 3 < 5 and some more text';
// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);
// Render it as HTML5:
print $html5->saveHTML($dom);
the print of $html5->saveHTML($dom) should return as
<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 > 1 & 3 < 5 and some more text</html>
but instead it return as
<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 > 1 & 3 5 and some more text</html>
see the missing encoded < of < character .
This is a continuation of https://github.com/symfony/symfony/issues/57597 where it is impacting the sanitization process of html-sanitizer
In an earlier investigation I noted that a small tweak to the library seems to fix the issue, though I haven't fully tested the change. From comments elsewhere:
...there does not appear to be any in-built way to avoid dropping the
<character because the parser immediately flushes the buffer and moves to the next character upon encountering it (ref).I did find that a slight tweak to the tokenizer captures the
<and correctly encodes it in the output. It only requires that two lines be added following the line raising the syntax error in the HTML5 Tokenizer:$this->scanner->unconsume(); $this->text($this->scanner->current());
hi, if the solution mentioned there works, can you provide a PR?
I have not fully tested the proposed solution (and, honestly, only have a high-level understanding of the code) so can't say if there are any potential ill effects. But I'm happy to put together a PR.
Hi, do you find a solution ? Tks
I created a PR with a change I think will address this issue with regard to the data state (content parsing).
There is a similar problem with tag/attribute parsing, though I think that could reasonably be addressed through a separate issue.