jsoup
jsoup copied to clipboard
Unable to add `html` and `head` tags into the whitelist.
I have found the following comment in the code on the Whitelist
class:
The cleaner and these whitelists assume that you want to clean a <code>body</code> fragment of HTML (to add user
supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the
document HTML around the cleaned body HTML, or create a whitelist that allows <code>html</code> and <code>head</code>
elements as appropriate.
And was trying to allow the html/head/meta
tags to be included into the result with code like this:
Whitelist.relaxed()
.addTags("!DOCTYPE html", "html", "head", "body", "meta", "style")
.addAttributes("meta", "charset");
Unfortunately, what I get from this is a code wrapped into <body>
tag with tags like <style>
moved inside the body.
Looks like this comment is 10-years old and obsolete. So is there is no way to include out-of-body tags into the process? Pretty critical in my case.
I was trying to reimplement your issue, could you please share url of the html?
Whitelist whiltelist = Whitelist.relaxed()
.addTags("!DOCTYPE html", "html", "head", "body", "meta", "style")
.addAttributes("meta", "charset");
String value = "<html><head><style>.some {color: red}</style></head><body>3<script>alert('pwned')</script>4</body></html>";
Jsoup.clean(value, whitelist);
// <body><style>.some {color: red}</style>34</body>
Hello, I just made some modifications to the static method clean to solve the problem you mentioned. Here is my solution for your reference. replace
Document dirty = parseBodyFragment(bodyHtml, baseUri);
to
Document dirty = parse(bodyHtml, baseUri);
Because that's what causes the head and body to blend together. Then consider whether the head and body are in the white list respectively.
Hello, I just made some modifications to the static method clean to solve the problem you mentioned. Here is my solution for your reference. replace
Document dirty = parseBodyFragment(bodyHtml, baseUri);
to
Document dirty = parse(bodyHtml, baseUri);
Because that's what causes the head and body to blend together. Then consider whether the head and body are in the white list respectively.
Thanks, I'll check it later, but as far as I remember - it is not enough. You can check my PR for reference.