jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Add ability to clean XML Document / preserve case in cleaned HTML

Open Juli4nSc opened this issue 2 years ago • 1 comments

Hello,

I'm trying to sanitize some SVG content and am using Jsoup for that specific case. It is possible to get an XML Document by using the xmlParser as below:

Document document = Jsoup.parse(svg, Parser.xmlParser());

However, there is no possible way next to clean this XML with a whitelist (Safelist). It handles the content as if it is HTML. Is there a way of doing this ? This would be expected with the XML parsing being enabled.

What I need here is preserving case sensitivity on the Attributes and Tags which is only possible when using XML parsing

Juli4nSc avatar Mar 29 '23 12:03 Juli4nSc

Right, the Cleaner right now is designed to take HTML body content and clean that. I had been thinking of adding extra support to clean a complete Document (vs a body fragment). That path would also then support XML Documents.

Another (and for your case, probably better) feature would be to enable case-insensitive attribute checks and output case-preserving HTML. You can almost do that now -- the cleaner checks tag normal names, but does not do that for attributes. So currently through the Cleaner, tag case can be preserved, but not attribute case.

What I need here is preserving case sensitivity on the Attributes and Tags which is only possible when using XML parsing

For just parsing (not the cleaner, as noted above), you can preserve tag and attribute case and still use the HTML parser. E.g.:

Document doc = Jsoup.parse(
    "<SVG viewBox=123 />",
    Parser.htmlParser()
        .settings(ParseSettings.preserveCase)
);
System.out.println(doc.html());

Gives

<SVG viewBox="123" />

Another nice to have may be to automatically preserve case in SVG elements when in HTML.

jhy avatar Mar 29 '23 22:03 jhy