antisamy stripNonValidXMLCharacters doesn't work with HTML where html.length()

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1

Open izian opened this issue 5 years ago • 10 comments

It looks like somewhere around version 1.5 that the method org.owasp.validator.html.scan.AntiSamyDOMScanner#stripNonValidXMLCharacters was altered to check if the Pattern for invalidXmlCharacters java.util.regex.Matcher#matches() . I presume that was in a bid for efficiency to cut down on a replaceAll method call if there was no need to affect the String input. You use the same technique in a test to check if there are time improvements made.

I believe it should use java.util.regex.Matcher#find() instead.

matches() checks if the entire sequence matches the pattern. Since the pattern represents only a single character, in effect, that can be in one of the defined sets, then if the sequence (the HTML) is longer than 1 character it can never match. It's been this way since forever I believe, at least Java 5.

find() will find the next subsequence that matches the pattern, in effect checking quickly and succeeding fast if the HTML needs to be cleansed.

You're getting a speed increase because matches() is getting to the second char and declaring the sequence as a non-match regardless.

Input HTML: <div>Hello\uD83D\uDC95</div>

Expected on org.owasp.validator.html.CleanResults#getCleanHTML : <div>Hello</div>

Actual: <div>Hello\uD83D\uDC95</div>

Where input HTML is single character \uD888 only then the output is the empty string as expected.

I looked through the test class here and can see no tests where you are expecting data to be cleansed. All the tests ensure that characters make it through ok or that something is faster (checking only 1 char is faster!)

Incidentally, I only noticed this since the Antisamy code looked to want to cleanse the characters needed for an emoji, where the character is actually valid in XML and HTML spec so far as I can tell, when their UTF-8 bytes are read by our system we get a Java representation in 16 bit char underpinning the String and the character points fall within your filter and, although I don't believe you should be stripping those if they come together and according to java.lang.Character#isSurrogatePair the two \uD83D \uDC95 together return true rather than false and the toCodePoint method tells us that it's &#128149 . So I think the checks in this method ought to be more complex. Ironically, if the code in this method worked as intended then the characters would have been cleansed away. But they weren't.

I believe you could get manipulative code points through now, because of this. But I can't be certain as I'm looking purely from a data cleansing point of view.

Feb 13 '19 11:02 izian

@izian I can replicate the behavior that <div>Hello\uD83D\uDC95</div> doesn't change when getting sanitzed by AntiSamy. However, when I try just: \uD888 that doesn't get sanitized either. Can you show me the exact usage of AntiSamy that is resulting in a single character getting sanitized properly as you describe?

@nahsra - Are these special unicode characters supposed to get stripped by AntiSamy's default policy?

Mar 29 '19 21:03 davewichers

The reason we were searching for these unusual (at least, they were unusual 10 years ago) Unicode ranges was the underlying XML parsing API blew up when encountering them. Therefore we had to add some defensive code to filter them out before processing.

I agree that this code could be cleaned up a lot, and the XML dependencies could have become more sturdy since then, and that there is nominal risk from this behavior. This issue should stay open and be acted upon but isn’t a blocker for release.

I appreciate the thoughtful write up.

Mar 30 '19 01:03 nahsra

To be clear this issue is not a vulnerability (@nahsra's 'nominal risk' comment) . It is intended functionality that doesn't work, but the fact that it doesn't isn't a security issue.

Mar 30 '19 02:03 davewichers

Thanks for checking it out. I cannot remember why we thought that this was part of the security of the utility. For some reason, when I first took a look, our code was documented as using this to make sure we didn’t get malicious control characters through. Perhaps these are a thing of the past or were never really in existence. In either case, given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed.

I just wanted to point out the matches / find. It’s a common trip up. Like forEach vs forEachOrdered and findFirst vs findAny and list removeAll not being the same as removing via iteration. I see matches / find issues in a lot of code.

Thanks for the time looking in to it

Mar 30 '19 08:03 izian

@nahsra - Given the comment: "given that this functionality hasn’t worked in a while (unless you pass through a single character) it would likely be worse for people who have become accustomed to no character filtering if it were “fixed” rather than simply removed if no longer needed." do you think we should still try to fix this, or just leave it alone?

Mar 30 '19 18:03 davewichers

@spassarop - Hey Sebastian - any. clue how to address this old issue? There are two tests cases for this issue already in the test class, but they are commented out. testGithubIssue34a() and testGithubIssue34b(). Same for issue #24 - and test case: testGithubIssue24(). If you could address either or both that would be fantastic.

Jan 13 '21 00:01 davewichers

@davewichers, @izian - I looked into the function that checks de regex of "invalid characters". I've tested with find() instead of matches() and for the DOM parser it gets better results. However, when it comes to replacing, the surrogate pair \uD83D\uDC95 which forms 💕 does not want to be entirely replaced. Instead, one of the unicode characters remains.

I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern:

"[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]"

I've built a string with all the supposedly invalid characters with the following code:

StringBuilder s = new StringBuilder();
char c = (char)0x0;
do {
    if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )
         s.append(c);
    c = (char)(c+1);
} while(c!=0xffff);

Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern):

assertEquals("<div>Hello</div>", as.scan("<div>Hello"+s.toString()+"</div>", policy, AntiSamy.DOM).getCleanHTML());

Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java:

if (in == null || ("".equals(in))) {
    return ""; // vacancy test.
}

StringBuilder cleanText = new StringBuilder(); // Used to hold the output.
char current; // Used to reference the current character.

for (int i = 0; i < in.length(); i++)
{
    current = in.charAt(i);
    if ((current == 0x9) || (current == 0xA) || (current == 0xD)
            || ((current >= 0x20) && (current <= 0xD7FF))
            || ((current >= 0xE000) && (current <= 0xFFFD))
            || ((current >= 0x10000) && (current <= 0x10FFFF)))
    {
        cleanText.append(current);
    }
}

return cleanText.toString();

And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous find() call can prevent this replacing from happening if it's too expensive to always do it.

About SAX parser:

It does not have the character check, it's implemented only at DOM parser.
If it gets implemented for SAX, it has the same behavior as DOM. But when the input is a Reader object, AntiSamy can't read the plain HTML without some transformations that I hope they don't mess with any character in it, if implemented.

That's all I can say after some hours of analyzing this.

Jan 17 '21 22:01 spassarop

Thanks Sebastian for researching! Do you think we should change/fix AntiSamy in some way or are we already doing the best we can with the parsers we are using?

-Dave

On Sun, Jan 17, 2021 at 5:38 PM Sebastián Passaro [email protected] wrote:

@davewichers https://github.com/davewichers, @izian https://github.com/izian - I looked into the function that checks de regex of "invalid characters". I've tested with find() instead of matches() and for the DOM parser it gets better results. However, when it comes to replacing, the surrogate pair \uD83D\uDC95 which forms 💕 does not want to be entirely replaced. Instead, one of the unicode characters remains.

I've tried with different inputs and changing the pattern but the result does not improve. I know it probably isn't a performant solution, but implementing the replacement by hand does in fact replace all characters. To remind, this is the pattern:

"[\u0000-\u001F\uD800-\uDFFF\uFFFE-\uFFFF&&[^\u0009\u000A\u000D]]"

I've built a string with all the supposedly invalid characters with the following code:

StringBuilder s = new StringBuilder(); char c = (char)0x0; do {
if ((c >= 0x0 && c <= 0x1f || c >= 0xd800 && c <= 0xdfff || c >= 0xfffe && c <= 0xffff) && c!=0x9 && c!=0xa && c!=0xd )

     s.append(c);

c = (char)(c+1);
} while(c!=0xffff);

Then I've run this assertion which leaves a single unicode character (can't remember which one but it's in the middle of a range from the pattern):

assertEquals("
Hello
", as.scan("
Hello"+s.toString()+"
", policy, AntiSamy.DOM).getCleanHTML());
Then I've copied and pasted the current implementation on the .NET project for StripNonValidXmlCharacters and made the changes to compile in Java:

if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}

StringBuilder cleanText = new StringBuilder(); // Used to hold the output. char current; // Used to reference the current character.

for (int i = 0; i < in.length(); i++)

{
current = in.charAt(i);

if ((current == 0x9) || (current == 0xA) || (current == 0xD)

        || ((current >= 0x20) && (current <= 0xD7FF))

        || ((current >= 0xE000) && (current <= 0xFFFD))

        || ((current >= 0x10000) && (current <= 0x10FFFF)))

{

    cleanText.append(current);

}
}

return cleanText.toString();

And well, that works because it's "manual" replacing without depending on Java's builtin functions. The previous find() call can prevent this replacing from happening if it's too expensive to always do it.

About SAX parser:

It does not have the character check, it's implemented only at DOM parser.

If it gets implemented for SAX, it has the same behavior as DOM. But when the input is a Reader object, AntiSamy can't read the plain HTML without some transformations that I hope they don't mess with any character in it, if implemented.

That's all I can say after some hours of analyzing this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/34#issuecomment-761893083, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGFWBKW57DSVJONMMCIRM3S2NRGTANCNFSM4GXEVWGQ .

Jan 17 '21 23:01 davewichers

I can take another look to refresh my 2 year memory of this one in the morrow. I remember thinking about: Are most strings clean? It would be good to not have overhead on the 99% use case. If you were about to strip a character is it worth checking on if it and it’s following character represent a surrogate pair? Do you even want to allow 16 bit but valid characters through?

I recognise that the use of this method by our developer is probably far removed from its intended use. The same developer also escaped JSON as if it were HTML.

I only happened upon this library by chance chasing down my missing code point and thought it was here it was lost; it was actually another library pretending to do good for XML and it only removed one of the code points; leaving back part of a surrogate pair which destroyed the XML when it was being serialised again.

maybe I could afford some time to take a look again here and see if I can follow what you’ve written and come to any understanding of a fix in line with your intention for the method; not our use :-)

Jan 17 '21 23:01 izian

I've extended the test @izian mention before, the one to test performance, just to test locally how bad it was the solution I proposed above. Executions vary, sometimes the result in the "manual" replacement is faster that replaceAll(), sometimes the other way round. However, differences are at max 5 ms. So if the implementation for the replacement changes, all characters are replaced and performance should stay the same.

The only thing that worries me is the SAX parser with the Reader object, because of the default encoding for reading it, do the cleansing and make the object again without losing something. Maybe I'm worrying when I shouldn't but I don't know that.

However, I've built the test string with all unicode characters that should be stripped also to make another test. The test is to don't strip at all. I've commented the line where cleansing occurrs and the library gave an output, so it didn't blow up. Maybe after the NekoHTML updates that is no longer a problem and we can remove the invalid XML characters validation. I just give that information, I prefer not to make that decision.

Jan 17 '21 23:01 spassarop

antisamy antisamy copied to clipboard

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1

antisamy
antisamy copied to clipboard