jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Unable to add `html` and `head` tags into the whitelist.

Open extempl opened this issue 3 years ago • 5 comments

I have found the following comment in the code on the Whitelist class:

The cleaner and these whitelists assume that you want to clean a <code>body</code> fragment of HTML (to add user
 supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the
 document HTML around the cleaned body HTML, or create a whitelist that allows <code>html</code> and <code>head</code>
 elements as appropriate.

And was trying to allow the html/head/meta tags to be included into the result with code like this:

Whitelist.relaxed()
                    .addTags("!DOCTYPE html", "html", "head", "body", "meta", "style")
                    .addAttributes("meta", "charset");

Unfortunately, what I get from this is a code wrapped into <body> tag with tags like <style> moved inside the body.

extempl avatar Apr 23 '21 12:04 extempl

Looks like this comment is 10-years old and obsolete. So is there is no way to include out-of-body tags into the process? Pretty critical in my case.

extempl avatar Apr 23 '21 12:04 extempl

I was trying to reimplement your issue, could you please share url of the html?

RyderCRD avatar Apr 25 '21 05:04 RyderCRD

Whitelist whiltelist = Whitelist.relaxed()
                    .addTags("!DOCTYPE html", "html", "head", "body", "meta", "style")
                    .addAttributes("meta", "charset");
String value = "<html><head><style>.some {color: red}</style></head><body>3<script>alert('pwned')</script>4</body></html>";
Jsoup.clean(value, whitelist);
// <body><style>.some {color: red}</style>34</body>

extempl avatar Apr 26 '21 04:04 extempl

Hello, I just made some modifications to the static method clean to solve the problem you mentioned. Here is my solution for your reference. replace

Document dirty = parseBodyFragment(bodyHtml, baseUri);

to

Document dirty = parse(bodyHtml, baseUri);

Because that's what causes the head and body to blend together. Then consider whether the head and body are in the white list respectively.

Ruefors avatar May 22 '21 09:05 Ruefors

Hello, I just made some modifications to the static method clean to solve the problem you mentioned. Here is my solution for your reference. replace

Document dirty = parseBodyFragment(bodyHtml, baseUri);

to

Document dirty = parse(bodyHtml, baseUri);

Because that's what causes the head and body to blend together. Then consider whether the head and body are in the white list respectively.

Thanks, I'll check it later, but as far as I remember - it is not enough. You can check my PR for reference.

extempl avatar May 22 '21 12:05 extempl