CDATA scanning in XML not behaving properly
Actually we are using Antisamy plugin to parse XML with content inside CDATA tag which used to work before this commit https://github.com/HtmlUnit/htmlunit-neko/commit/49a31c089482088aca57facb20e1a15792cfc3bd was added in htmlunit-neko
For example an XML like this,
<xt:c-code xt:name="code" xt:version="1" xt:id="15ae0cc7-ded7-4a74-97b8-d66238d3c177"><xt:parameter xt:name="language">html</xt:parameter><xt:text-body><![CDATA[<div></div>]]></xt:text-body></xt:c-code>
Before this commit the result for CDATA scanning part was <![CDATA[<div></div>]]> but after this commit the result is <![CDATA[<div]]>]]>
We are parsing this XML, specifically the content inside CDATA and then storing it. Later when viewing we extract the content inside CDATA and render it on the web page.
Also raised an issue for same on htmlunit-neko repo,
https://github.com/HtmlUnit/htmlunit-neko/issues/125
Is this the expected behaviour going forward? Is there a way we can bring back previous behaviour for folks who maybe using the same for XML content parsing.
Have added a new feature in neko 'http://cyberneko.org/html/features/scanner/cdata-early-closing' version 4.6.0. You have to set this if you are parsing XHtml code because there we do not have to do this strange early closing.
Hopefully there is a way to do this from antisamy.
@spassarop - Can you research this? Neko-htmlunit v4.6.0 is included in the AntiSamy:1.7.7 we just released.
I was not able to reproduce such output entirely. I tried adding the custom tags to the default policy and use the whole XML and also tried just scanning the CDATA. All by guessing policy and input string as it was not explicitly stated with a code example.
What I do get is this kind of output regarding the CDATA section in every scan <div]]>. Which is similar.
If I add the feature @rbri mentioned, the output changes to <div]]> when it is set to true and <div></div> when set to false. It seems the second one is the expected one in the issue description.
What we can do, if that matches the desired behavior, is to add a new directive that allows to set that feature in SAX and DOM parsers by policy. What I am not sure is what default value to use, probably the best would be setting it to false by default as that was the behavior before the state change when upgrading Neko.
@akshay-kr, if this description and analysis seems accurate to your needs, let us know.
Sorry for making thinks a bit more complicated. But i found another issue in neko regarding validating of attribute names. The root cause is more or less the same like for this one - when parsing html some things are really different (and more complicated) compared to parsing Xhtml. Currently i think about making the parser a bit more clever and automatically choosing the correct way of working instead of having this kind of switches and control it from the outside.
@rbri - You just released 4.10.0. Do any of the new releases since 4.6.0 help with this issue so we can try to address it cleanly using your library rather than having to do a bunch of extra work outside your library.? CC to: @spassarop
@davewichers sorry this was a bit forgotten, still thinking....
@rbri - Nudge. Any thoughts/feedback on this?
will come back to you during the next days... sorry for the delay
@davewichers @spassarop @akshay-kr
as promised I had another look at this...
Currently i think about making the parser a bit more clever and automatically choosing the correct way of working instead of having this kind of switches and control it from the outside.
after some more checks and a deeper look, it think this is not really possible
To summarize the whole story, parsing xhml and parsing html are different, at least some of the quirks of html are definitive working different for xhtml. Therefore neko has some settings...
The sad part of the storry (or these setings) is: they have to be configured before the parsing starts. As an example you can have a look at the htmlunit code here https://github.com/HtmlUnit/htmlunit/blob/56f3644cf495d001b3dbb8cb5f629896a2c63793/src/main/java/org/htmlunit/html/parser/neko/HtmlUnitNekoHtmlParser.java#L140
I fear for antisamy this means we have to find a way to figure out if the content to be parsed is xhtml or html. I'm not really in the code so maybe you already have a suggestion how to do that.
Sorry for the last answer....