Antisamy 1.7.5 version - <body> tag issue
The Antisamy library versions above 1.7.2 require a <body> tag in the HTML page; otherwise, it causes the HTML to break. Here's an example of the input HTML:
<html lang="en">
<head>
</head>
<table class="container" cellpadding="0" cellspacing="0" align="center">
<SELECT NAME="Lang">
<OPTION VALUE="da">Dansk</OPTION>
<OPTION VALUE="en" selected=selected>English</OPTION>
</SELECT>
</table>
</html>
The output produced is:
<html lang="en">
<head>
<table class="container" cellpadding="0" cellspacing="0" align="center"></table>
**<select name="Lang"></select>**
<option value="da">Dansk</option>
<option value="en" selected="selected">English</option>
</head>
</html>
As you can see, the <select> tag closes on the same line, causing the dropdown to malfunction and breaking the HTML page. This issue does not occur in Antisamy version 1.7.2 and earlier but appears in versions after 1.7.2. We are upgrading Antisamy in our project to version 1.7.5, but this issue is causing the complete HTML page to become distorted.
Will have a look...
Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?
Had a quick look and have added a test case to neko. From this first look it seems like neko works ok. Maybe the tag is closed by some cleanup?
can antisamy version 1.7.5 adds <body > tag if its missed or not added ?
@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).
And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.
@jeetu22 i do not know so much about the inner workings of antisamy but i'm responsible for the neko-htmlunit parser (https://github.com/HtmlUnit/htmlunit-neko) used by antisamy to parse the html file and convert it into a dom tree. During this process some cleanup is done to form a valid dom (or emit valid sax events).
And yes missing body (start) elements are added for from valid dom trees. Proving this for your case was exactly the reason to write the additional test case for the parser.
Thank you very much!!. as antisamy uses neko internally , anyone from Antisamy who can guide us in this scenario.i m suspecting HTMLScanner.java is modifying DOM
Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.
Maybe the org.owasp.validator.html.scan.MagicSAXFilter is the one - but only a guess.
@rbri
public void selectInsideEmptyTable() throws Exception {
final String html = "<html><head></head><body>\n"
+ "<table><select name='Lang'><option value='da'>Dansk</option></select></table>\n"
+ "<script>\n"
+ LOG_TITLE_FUNCTION
+ "log(document.body.childNodes.length);\n"
+ "log(document.body.children.length);\n"
+ "log(document.body.children[0]);\n"
+ "log(document.body.children[1]);\n"
+ "log(document.body.children[2]);\n"
+ "</script>\n"
+ "</body></html>";
expandExpectedAlertsVariables(URL_FIRST);
loadPageVerifyTitle2(html);
}
can you please remove body tag from this Junit test case and assert that output HTML should contains <body> tag.
@spassarop - Can you look into this with @rbri?
i guess i found the reason - will analyze this a bit more
Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem. Will require some time to fix that.
Thanks @rbri for being so proactive with this.
@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.
Thanks @rbri for being so proactive with this.
@jeetu22, even though @rbri seem to have reproduced the problem to debug, it would be useful if you provide how are you calling AntiSamy and what policy you are using. These factors make AntiSamy decide if it should use DOM or SAX parser, o which tags to preserve.
we are using SAX parser.
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.
As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)
parser.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
@jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.
As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)
Thank you for the update! I appreciate you looking into the issue.Given your busy schedule, I completely understand that a fix might take some time.
Please take the time you need, and I look forward to your findings.
Thanks again for your efforts!
I don’t know too much about the SAX parser, so I have no idea about the difference nor why AntiSamy uses fragment parser. It could be changed and see how the tests react.
On Mon, 27 May 2024 at 04:30 Jitendra @.***> wrote:
parser.setFeature(" http://cyberneko.org/html/features/balance-tags/document-fragment", true);
@jeetu22 https://github.com/jeetu22, setting this feature for the parser changes the behavior in some ways. One of the effects is the one you are facing - the tag balancer no longer adds missing body tags. But there are some others also.
As promised i will have a look at all that - at the moment i'm thinking about why antisamy should use the fragment way of parsing at all. Because i'm working on all this in my spare time and i have some other private things on my todo list, please be a bit patient to do not see a fix in the next hours ;-)
Thank you for the update! I appreciate you looking into the issue.Given your busy schedule, I completely understand that a fix might take some time.
Please take the time you need, and I look forward to your findings.
Thanks again for your efforts!
— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/453#issuecomment-2132824963, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMIMIZA76YXAHWLM7CDZELOHXAVCNFSM6AAAAABIG5VRYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSHAZDIOJWGM . You are receiving this because you were mentioned.Message ID: @.***>
After some thinking
- i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages
- i have improved the fragment parser in a way that the parser now takes care of an existing html tag - if this was passed before the automatic generation of head and body tags is enabled also in the fragment mode
- have added the issue as test case
@rbri wrote:
- i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages
That is exactly why! In fact, I think that is the most common use case for HTML sanitizers in general. There's generally some user input that you might capture that only allows some specific mark-up (and which mark up may be vary from one use to another) and you want to sanitize that to make it safe to use it in a broader context of an application generated page. I think it's rare that AntiSamy or the OWASP HTML Sanitizer project would get a complete HTML page to sanitize. That's certainly a valid use case too, but just not one that is as common. If AntiSamy ditched the fragment parser, then I think that ESAPI would have to ditch AntiSamy because dealing with HTML fragments is what Validator.getValidSafeHTML is generally expecting.
Oh right, of course. I didn’t know what fragment parser meant initially.
On Mon, 27 May 2024 at 11:59 Kevin W. Wall @.***> wrote:
@rbri https://github.com/rbri wrote:
- i have an idea why the fragment parser is used - at least form the tests it looks like antisamy also can clean html snippets, not only complete html pages
That is exactly why! In fact, I think that is the most common use case for HTML sanitizers in general. There's generally some user input that you might capture that only allows some specific mark-up (and which mark up may be vary from one use to another) and you want to sanitize that to make it safe to use it in a broader context of an application generated page. I think it's rare that AntiSamy or the OWASP HTML Sanitizer project would get a complete HTML page to sanitize. That's certainly a valid use case too, but just not one that is as common. If AntiSamy ditched the fragment parser, then I think that ESAPI would have to ditch AntiSamy because dealing with HTML fragments is what Validator.getValidSafeHTML https://javadoc.io/static/org.owasp.esapi/esapi/2.5.3.1/org/owasp/esapi/Validator.html#getValidSafeHTML-java.lang.String-java.lang.String-int-boolean-org.owasp.esapi.ValidationErrorList- is generally expecting.
— Reply to this email directly, view it on GitHub https://github.com/nahsra/antisamy/issues/453#issuecomment-2133651990, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHL3BMMSMUEIHHRT77J737DZENC4XAVCNFSM6AAAAABIG5VRYWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGY2TCOJZGA . You are receiving this because you were mentioned.Message ID: @.***>
@davewichers @spassarop fix is ready in PR #454
@spassarop - can you create test case for this situation that fails, and then verify that it now passes with his snapshot version?
My PR includes such an test case...
Hahaha yeah, our man here is one step ahead ;)
neko 4.2.0 released
Ok, antisamy is using the fragment parser instead of the document parser; with the fragment parser i can reproduce the problem. Will require some time to fix that.
@rbri , can you confirm if the Neko 4.2.0 release resolves the above issue?
@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)
@jeetu22 thats the goal - but i guess there will be a new release of antisamy itself soon (see #454 for more details)
i tried with Antisamy:1.7.5 and neko-4.2.0 many testcases are failing in AntisamyTest.java
one such example:
@jeetu22 strange - have done this right now
- checkout the current code of the antisamy project
- change the version of neko in the pom
- run 'mvn clean test'
for me this looks like you still have an old version of neko somewhere in your class path... can you please provide the whole stack trace...
checking , will update you @rbri .
Hi @rbri,
We've thoroughly tested Antisamy 1.7.6-SNAPSHOT and found that both the workflow and UI are working fine. It would be great if you could provide a tentative release date for the non-snapshot version.
Thank you!