java-html-sanitizer icon indicating copy to clipboard operation
java-html-sanitizer copied to clipboard

Sanitizing `<image src=...>` results in misnested tag output.

Open mikesamuel opened this issue 8 years ago • 7 comments

wynne.jg reports

Consider the following whitelist • tag and it's attribute src is allowed • tags are getting converted to tags

if I then try to sanitize

<image src="http://picture.com/some_picture.gif" />

I get back

<img /></img> 

which is missing the src attrib

This seems to indicate 2 things. • Replacement happens after sanitization (in this example, the src attrib was removed before it was converted to <img>) This can be alleviated by allowing the same attributes (src) on <image>, but is that safe to do? is there another way to specify the order of sanitization/replacement? • Something is not right with Custom Policies and how tags are closed (you can see in the example that the <img> tag is already closed in it's opening tag, however OJHS has added an extra closing tag that is not necessary) This is not something I have found a workaround for and may perhaps be a genuine bug? Anyone else face this?

mikesamuel avatar Dec 12 '16 20:12 mikesamuel

https://www.w3.org/TR/html5/syntax.html#parsing-main-inbody explains that tags are handled by translating to and reparsing

A start tag whose tag name is "image" Parse error. Change the token's tag name to "img" and reprocess it. (Don't ask.)

The "(Don't ask.)" is probably non-normative.

My guess is that we're using "image" as a key into a table of elements that don't require an end tag when deciding whether to push an entry onto a stack of unclosed elements but using "img" as a key in the tag renderer when deciding whether to make it self-closing.

I think the best solution is to not monkey around with the tag name in the policy code. We've done a bunch of work to make sure that bugs in the parser do not affect safety or wellformedness properties of our output. By moving this kind of syntactic desugaring code into the parse phase, we can simplify the code in the TCB.

mikesamuel avatar Dec 12 '16 20:12 mikesamuel

    assertEquals(
        "<img src=\"http://com.example/foo.png\" />",
        apply(
            new HtmlPolicyBuilder()
            .allowElements("img")
            .allowElements(
                new ElementPolicy() {

                  public String apply(String elementName, List<String> attrs) {
                    return "img";
                  }

                }, "image")
            .allowAttributes("src").onElements("img", "image")
            .allowStandardUrlProtocols(),
            "<image src=\"http://com.example/foo.png\" />"));

repeats the problem, but without the custom ElementPolicy it does not.

@wynne.jg, Are you doing something similar in your policy?

mikesamuel avatar Dec 12 '16 20:12 mikesamuel

Assuming the testcase captures the OP's problem, https://github.com/OWASP/java-html-sanitizer/commit/8ae326eba6b9bac6036e01850396c1af5b39e804 should address this.

mikesamuel avatar Dec 12 '16 22:12 mikesamuel

Hey @mikesamuel, yes I'm doing something similar to the example you pasted. Not sure what you mean by "but without the custom ElementPolicy it does not." Surely there'd be no replacement occurring if you didn't have the custom ElementPolicy? Also, I see that you also allow src on as well as - I guess this is the intended usage (even though there'll be no tags in the eventual output)?

pickle-weasle avatar Dec 14 '16 16:12 pickle-weasle

@pickle-weasle , responses inline.

Hey @mikesamuel, yes I'm doing something similar to the example you pasted. Not sure what you mean by "but without the custom ElementPolicy it does not." Surely there'd be no replacement occurring if you didn't have the custom ElementPolicy?

Just noting for the record.

Also, I see that you also allow src on as well as - I guess this is the intended usage (even though there'll be no tags in the eventual output)?

The HTML serializer tends to drop tags that don't have critical attributes for compatibility with the old AntiSamy sanitizer. For example, <a> without either name or href. I allowed src so that enough attributes would reach the renderer for the allowed tag to reach it.

mikesamuel avatar Jan 12 '17 17:01 mikesamuel

Pretty sure this is fixed in the newer releases, cheers Mike!

pickle-weasle avatar Apr 18 '17 12:04 pickle-weasle

Can this issue be closed?

csware avatar Jan 31 '24 09:01 csware