antisamy icon indicating copy to clipboard operation
antisamy copied to clipboard

Antisamy 1.7.5 version - <body> tag issue

Open jeetu22 opened this issue 1 year ago • 29 comments

The Antisamy library versions above 1.7.2 require a <body> tag in the HTML page; otherwise, it causes the HTML to break. Here's an example of the input HTML:

<html lang="en">
<head>
</head> 
<table class="container" cellpadding="0" cellspacing="0" align="center">    
<SELECT NAME="Lang">
<OPTION VALUE="da">Dansk</OPTION>
<OPTION VALUE="en" selected=selected>English</OPTION>
</SELECT>
</table>
</html>

The output produced is:

<html lang="en">
<head>
    <table class="container" cellpadding="0" cellspacing="0" align="center"></table>
    **<select name="Lang"></select>**
    <option value="da">Dansk</option>
    <option value="en" selected="selected">English</option>
</head>
</html>

As you can see, the <select> tag closes on the same line, causing the dropdown to malfunction and breaking the HTML page. This issue does not occur in Antisamy version 1.7.2 and earlier but appears in versions after 1.7.2. We are upgrading Antisamy in our project to version 1.7.5, but this issue is causing the complete HTML page to become distorted.

jeetu22 avatar May 24 '24 06:05 jeetu22

Will have a look...

rbri avatar May 24 '24 07:05 rbri

Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?

rbri avatar May 24 '24 07:05 rbri

Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?

can antisamy version 1.7.5 adds <body > tag if its missed or not added ?

jeetu22 avatar May 24 '24 07:05 jeetu22

@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).

And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.

rbri avatar May 24 '24 08:05 rbri

@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).

And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.

Thank you very much!!. as antisamy uses neko internally , anyone from Antisamy who can guide us in this scenario.i m suspecting HTMLScanner.java is modifying DOM

jeetu22 avatar May 24 '24 08:05 jeetu22

Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.

rbri avatar May 24 '24 08:05 rbri

Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.

@rbri

public void selectInsideEmptyTable() throws Exception {
       final String html = "<html><head></head><body>\n"
               + "<table><select name='Lang'><option value='da'>Dansk</option></select></table>\n"
               + "<script>\n"
               + LOG_TITLE_FUNCTION
               + "log(document.body.childNodes.length);\n"
               + "log(document.body.children.length);\n"
               + "log(document.body.children[0]);\n"
               + "log(document.body.children[1]);\n"
               + "log(document.body.children[2]);\n"
               + "</script>\n"
               + "</body></html>";

       expandExpectedAlertsVariables(URL_FIRST);

       loadPageVerifyTitle2(html);
   }

can you please remove body tag from this Junit test case and assert that output HTML should contains <body> tag.

jeetu22 avatar May 24 '24 12:05 jeetu22

@spassarop - Can you look into this with @rbri?

davewichers avatar May 24 '24 14:05 davewichers

i guess i found the reason - will analyze this a bit more

rbri avatar May 26 '24 09:05 rbri

Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem. Will require some time to fix that.

rbri avatar May 26 '24 10:05 rbri

Thanks @rbri for being so proactive with this.

@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.

spassarop avatar May 26 '24 14:05 spassarop

Thanks @rbri for being so proactive with this.

@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.

we are using SAX parser. parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

jeetu22 avatar May 27 '24 07:05 jeetu22

parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.

As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)

rbri avatar May 27 '24 07:05 rbri

parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.

As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)

Thank you for the update! I appreciate you looking into the issue.Given your busy schedule, I completely understand that a fix might take some time.

Please take the time you need, and I look forward to your findings.

Thanks again for your efforts!

jeetu22 avatar May 27 '24 07:05 jeetu22

I don’t know too much about the SAX parser, so I have no idea about the difference nor why AntiSamy uses fragment parser. It could be changed and see how the tests react.

On Mon, 27 May 2024 at 04:30 Jitendra @.***> wrote:

parser.setFeature(" http://cyberneko.org/html/features/balance-tags/document-fragment", true);

@jeetu22 https://github.com/jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.

As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)

Thank you for the update! I appreciate you looking into the issue.Given your busy schedule, I completely understand that a fix might take some time.

Please take the time you need, and I look forward to your findings.

Thanks again for your efforts!

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/453#issuecomment-2132824963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMIMIZA76YXAHWLM7CDZELOHXAVCNFSM6AAAAABIG5VRYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSHAZDIOJWGM . You are receiving this because you were mentioned.Message ID: @.***>

spassarop avatar May 27 '24 11:05 spassarop

After some thinking

  • i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages
  • i have improved the fragment parser in a way that the parser now takes care of an existing html tag - if this was passed before the automatic generation of head and body tags is enabled also in the fragment mode
  • have added the issue as test case

rbri avatar May 27 '24 12:05 rbri

@rbri wrote:

  • i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages

That is exactly why! In fact, I think that is the most common use case for HTML sanitizers in general. There's generally some user input that you might capture that only allows some specific mark-up (and which mark up may be vary from one use to another) and you want to sanitize that to make it safe to use it in a broader context of an application generated page. I think it's rare that AntiSamy or the OWASP HTML Sanitizer project would get a complete HTML page to sanitize. That's certainly a valid use case too, but just not one that is as common. If AntiSamy ditched the fragment parser, then I think that ESAPI would have to ditch AntiSamy because dealing with HTML fragments is what Validator.getValidSafeHTML is generally expecting.

kwwall avatar May 27 '24 14:05 kwwall

Oh right, of course. I didn’t know what fragment parser meant initially.

On Mon, 27 May 2024 at 11:59 Kevin W. Wall @.***> wrote:

@rbri https://github.com/rbri wrote:

  • i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages

That is exactly why! In fact, I think that is the most common use case for HTML sanitizers in general. There's generally some user input that you might capture that only allows some specific mark-up (and which mark up may be vary from one use to another) and you want to sanitize that to make it safe to use it in a broader context of an application generated page. I think it's rare that AntiSamy or the OWASP HTML Sanitizer project would get a complete HTML page to sanitize. That's certainly a valid use case too, but just not one that is as common. If AntiSamy ditched the fragment parser, then I think that ESAPI would have to ditch AntiSamy because dealing with HTML fragments is what Validator.getValidSafeHTML https://javadoc.io/static/org.owasp.esapi/esapi/2.5.3.1/org/owasp/esapi/Validator.html#getValidSafeHTML-java.lang.String-java.lang.String-int-boolean-org.owasp.esapi.ValidationErrorList- is generally expecting.

— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/453#issuecomment-2133651990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMMSMUEIHHRT77J737DZENC4XAVCNFSM6AAAAABIG5VRYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGY2TCOJZGA . You are receiving this because you were mentioned.Message ID: @.***>

spassarop avatar May 27 '24 15:05 spassarop

@davewichers @spassarop fix is ready in PR #454

rbri avatar Jun 01 '24 09:06 rbri

@spassarop - can you create test case for this situation that fails, and then verify that it now passes with his snapshot version?

davewichers avatar Jun 01 '24 18:06 davewichers

My PR includes such an test case...

rbri avatar Jun 01 '24 22:06 rbri

Hahaha yeah, our man here is one step ahead ;)

spassarop avatar Jun 01 '24 22:06 spassarop

neko 4.2.0 released

rbri avatar Jun 04 '24 17:06 rbri

Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem. Will require some time to fix that.

@rbri , can you confirm if the Neko 4.2.0 release resolves the above issue?

jeetu22 avatar Jun 05 '24 06:06 jeetu22

@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)

rbri avatar Jun 05 '24 07:06 rbri

@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)

i tried with Antisamy:1.7.5 and neko-4.2.0 many testcases are failing in AntisamyTest.java one such example: image

jeetu22 avatar Jun 05 '24 07:06 jeetu22

@jeetu22 strange - have done this right now

  • checkout the current code of the antisamy project
  • change the version of neko in the pom
  • run 'mvn clean test'

image

rbri avatar Jun 05 '24 08:06 rbri

for me this looks like you still have an old version of neko somewhere in your class path... can you please provide the whole stack trace...

rbri avatar Jun 05 '24 08:06 rbri

checking , will update you @rbri .

jeetu22 avatar Jun 13 '24 06:06 jeetu22

Hi @rbri,

We've thoroughly tested Antisamy 1.7.6-SNAPSHOT and found that both the workflow and UI are working fine. It would be great if you could provide a tentative release date for the non-snapshot version.

Thank you!

jeetu22 avatar Jul 03 '24 12:07 jeetu22