java-html-sanitizer
java-html-sanitizer copied to clipboard
<img> srcset attribute encoding spaces before width descriptor to %20
From looking through some HTML Reference, I can see that the srcset attribute of <img> element should be able to accept:
A list of one or more strings separated by commas indicating a set of possible image sources for the user agent to use. Each string is composed of:
a URL to an image, **optionally, whitespace followed by one of:** a width descriptor, or a positive integer directly followed by 'w'. The width descriptor is divided by the source size given in the sizes attribute to calculate the effective pixel density. a pixel density descriptor, which is a positive floating point number directly followed by 'x'.
An example of this could be
srcset="https://developer.cdn.mozilla.net/static/img/beast-404.ce38fcf80386.png 1000w".
I recently updated my Java HTML Sanitizer release version and noticed that there is now a bug.
The sanitizer assumes the space and width are part of the url and so encodes them like so:
srcset="https://developer.cdn.mozilla.net/static/img/beast-404.ce38fcf80386.png%201000w"
I'm guessing this is related somehow: https://github.com/OWASP/java-html-sanitizer/issues/20
Yep. Will add handling for groups of URLs.
I am facing same issue, when I upgraded OWASP-html-sanitizer.jar to latest. I have verified that this issue has been introduced in 20160614.1 release(it was working in 20160526.1 release).
Here is my program:
public class URLSanitizationTest {
static void sanitizeURL(String url){
// Set up the rules for the sanitizer.
PolicyFactory pf = new HtmlPolicyBuilder().allowUrlProtocols("http", "https", "file").allowElements("img")
.allowAttributes("src").onElements("img").toFactory();
// The sanitizer works better when given a complete tag.
final String prefix = "<img src=\"";
final String suffix = "\" />";
StringBuilder input = new StringBuilder();
input.append(prefix);
input.append(url);
input.append(suffix);
// Sanitize.
String output = pf.sanitize(input.toString());
System.out.println("Sanitized URL:"+output);
}
public static void main(String []args){
String url = "http://www.mks.com/image s/en/logob.gif onload=\"alert('hi')\"@url\"";
sanitizeURL(url);
}
}
Output before 20160614.1 release:
Sanitized URL:<img src="http://www.mks.com/image s/en/logob.gif onload=" />
Output since 20160614.1 release:
Sanitized URL:<img src="http://www.mks.com/image%20s/en/logob.gif%20onload=" />
I am not sure whether this is expected behavior(if yes, Why?) or an issue.
Image src before 20160614.1 release:"http://www.mks.com/image s/en/logob.gif onload=" Image src after 20160614.1 release:"http://www.mks.com/image%20s/en/logob.gif%20onload=" In first output consider the & with # 61 is there.
Can this issue be closed now?