html5-php Parser remove the single < (less than) character from given html string

Parser remove the single < (less than) character from given html string

Open touhidurabir opened this issue 1 year ago • 5 comments

When parsing a html string with single use of <, it removes it from the parsed value that being returned . For example

<?php

use Masterminds\HTML5;

$html = '<img src="invalid-url" onerror="alert(\'XSS Attack prefix\')" /> 2 > 1 & 3 < 5 and some more text';

// Parse the document. $dom is a DOMDocument.
$html5 = new HTML5();
$dom = $html5->loadHTML($html);

// Render it as HTML5:
print $html5->saveHTML($dom);

the print of $html5->saveHTML($dom) should return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3 &lt; 5 and some more text</html>

but instead it return as

<!DOCTYPE html>
<html><img src="invalid-url" onerror="alert('XSS Attack prefix')"> 2 &gt; 1 &amp; 3  5 and some more text</html>

see the missing encoded < of < character .

This is a continuation of https://github.com/symfony/symfony/issues/57597 where it is impacting the sanitization process of html-sanitizer

Jul 05 '24 12:07 touhidurabir

In an earlier investigation I noted that a small tweak to the library seems to fix the issue, though I haven't fully tested the change. From comments elsewhere:

...there does not appear to be any in-built way to avoid dropping the < character because the parser immediately flushes the buffer and moves to the next character upon encountering it (ref).

I did find that a slight tweak to the tokenizer captures the < and correctly encodes it in the output. It only requires that two lines be added following the line raising the syntax error in the HTML5 Tokenizer:
$this->scanner->unconsume();
$this->text($this->scanner->current());

Jul 05 '24 19:07 bsweeney

hi, if the solution mentioned there works, can you provide a PR?

Jul 17 '24 13:07 goetas

I have not fully tested the proposed solution (and, honestly, only have a high-level understanding of the code) so can't say if there are any potential ill effects. But I'm happy to put together a PR.

Jul 21 '24 15:07 bsweeney

Hi, do you find a solution ? Tks

Aug 05 '24 11:08 benoit-waldmann

I created a PR with a change I think will address this issue with regard to the data state (content parsing).

There is a similar problem with tag/attribute parsing, though I think that could reasonably be addressed through a separate issue.

Aug 07 '24 04:08 bsweeney

html5-php html5-php copied to clipboard

Parser remove the single < (less than) character from given html string

html5-php
html5-php copied to clipboard