java-html-sanitizer
java-html-sanitizer copied to clipboard
Sanitizing `<image src=...>` results in misnested tag output.
wynne.jg reports
Consider the following whitelist
• tag and it's attribute src is allowed
•
tags
if I then try to sanitize
<image src="http://picture.com/some_picture.gif" />
I get back
<img /></img>
which is missing the src attrib
This seems to indicate 2 things.
• Replacement happens after sanitization (in this example, the src attrib was removed before it was converted to <img>)
This can be alleviated by allowing the same attributes (src) on <image>, but is that safe to do? is there another way to specify the order of sanitization/replacement?
• Something is not right with Custom Policies and how tags are closed (you can see in the example that the <img> tag is already closed in it's opening tag, however OJHS has added an extra closing tag that is not necessary)
This is not something I have found a workaround for and may perhaps be a genuine bug? Anyone else face this?
https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody explains that and reparsing
A start tag whose tag name is "
image" Parse error. Change the token's tag name to "img" and reprocess it. (Don't ask.)
The "(Don't ask.)" is probably non-normative.
My guess is that we're using "image" as a key into a table of elements that don't require an end tag when deciding whether to push an entry onto a stack of unclosed elements but using "img" as a key in the tag renderer when deciding whether to make it self-closing.
I think the best solution is to not monkey around with the tag name in the policy code. We've done a bunch of work to make sure that bugs in the parser do not affect safety or wellformedness properties of our output. By moving this kind of syntactic desugaring code into the parse phase, we can simplify the code in the TCB.
assertEquals(
"<img src=\"http://com.example/foo.png\" />",
apply(
new HtmlPolicyBuilder()
.allowElements("img")
.allowElements(
new ElementPolicy() {
public String apply(String elementName, List<String> attrs) {
return "img";
}
}, "image")
.allowAttributes("src").onElements("img", "image")
.allowStandardUrlProtocols(),
"<image src=\"http://com.example/foo.png\" />"));
repeats the problem, but without the custom ElementPolicy it does not.
@wynne.jg, Are you doing something similar in your policy?
Assuming the testcase captures the OP's problem, https://github.com/OWASP/java-html-sanitizer/commit/8ae326eba6b9bac6036e01850396c1af5b39e804 should address this.
Hey @mikesamuel, yes I'm doing something similar to the example you pasted.
Not sure what you mean by "but without the custom ElementPolicy it does not."
Surely there'd be no replacement occurring if you didn't have the custom ElementPolicy?
Also, I see that you also allow src on - I guess this is the intended usage (even though there'll be no
@pickle-weasle , responses inline.
Hey @mikesamuel, yes I'm doing something similar to the example you pasted. Not sure what you mean by "but without the custom ElementPolicy it does not." Surely there'd be no replacement occurring if you didn't have the custom ElementPolicy?
Just noting for the record.
Also, I see that you also allow
srcon as well as - I guess this is the intended usage (even though there'll be no tags in the eventual output)?
The HTML serializer tends to drop tags that don't have critical attributes for compatibility with the old AntiSamy sanitizer. For example, <a> without either name or href.
I allowed src so that enough attributes would reach the renderer for the allowed tag to reach it.
Pretty sure this is fixed in the newer releases, cheers Mike!
Can this issue be closed?