jsoup
jsoup copied to clipboard
Element.hasClass ignores html strict mode
Hi, Jsoup ignores case sensitive class selector. This happens regardless if we use strict html mode or not (<!DOCTYPE html>). It causes different behaviour from a browser behaviour when using strict mode.
For example: <!DOCTYPE html> <html><head><style type="text/css"> .c1{ font-size:44px; } .C1{ color:red; } </style></head><body> <div class="c1"> Some text </div></body></html>
The following will fetch the div, although the c is in lowercase in the div: document.select(".C1");
My findings: The class evaluator matches method calls Element.hasClass. Element.has class checks for a match - ignoring the case sensitive.
Excuse me. I have reproduced the behaviour by:
public static void main(String[] args) throws IOException {
String path = "<!DOCTYPE html>\n" +
"<html>\n" +
"<head>\n" +
" <style type=\"text/css\">\n" +
" .c1 {\n" +
" font-size: 44px;\n" +
" }\n" +
"\n" +
" .C1 {\n" +
" color: #ffa578;\n" +
" }\n" +
" </style>\n" +
"</head>\n" +
"<body>\n" +
"<div class=\"c1\">\n" +
" Some text\n" +
"</div>\n" +
"</body>\n" +
"</html>";
Document doc = Jsoup.parse(path);
System.out.println(doc.select("[class=C1]").get(0).text());
System.out.println(doc.select("[class=c1]").get(0).text());
}
Could you tell me how to use html strict mode so I can test and add some features for Jsoup.select()?
Hi, In html5, we set strict mode by adding at the beginning of the html. If we remove it, we don't use strict mode.
Hi, I think jsoup currently does not support case-sensitive select() and does not depend on whether it is html strict mode. from here you can see that selectors in jsoup are case insensitive.
For simplicity, maybe you can do text replacement before select, and replace the uppercase or lowercase search content with different content to eliminate conflicts, or you can nest another case sensitive method after selection.
There is no doubt that your findings are correct. In source code of jsoup 1.13.1 (the latest version so far), if we change the 1374th line of Element.java from "return className.equalsIgnoreCase(classAttr);" to "return className.equals(classAttr);" then the problem with the example you gave is solved. Class ".c1" with a lowercase c in it will not be selected by document.select(".C1"); any more.
If we want to solve this problem completely, we need to add an case sensitive option to the selectors in jsoup. Due to default parameters are not supported in Java, and for not to distrubing old funtions, overloading the hasxxx methods seems a good solution.
For example:
public boolean hasClass(String className) {
return this.hasClass(className, false);
}
public boolean hasClass(String className, boolean caseSensitive) {
//some code here
if (len == wantLen) {
if(caseSensitive)
return className.equals(classAttr);
return className.equalsIgnoreCase(classAttr);
}
//some code here
}
}
But then a branch of methods need to be modified like this, since the methods are nested and we need to pass the boolean value from head to tail. This will make jsoup more complex, I'm not sure if it will bring some bad effects.
Also, as far as I know, HTML class names are case-sensitive, while CSS selectors are generally case-insensitive. My suggestion is that we should always write code case sensitively.
I have tried to fix this issue, following is my pull request. #1527 Now you can case-sensitively select classes with .select(".classname", true) if you want.
Here‘s the code. Hope this helps you.
Great. Thanks :)
You're welcome! Just a reminder, you may also write like this to automatically determine whether to use strict mode.
boolean htmlStrictMode;
try{
htmlStrictMode = doc.documentType().name().equals("html");
}catch (NullPointerException e) {
htmlStrictMode = false;
}
doc.select(".classname", htmlStrictMode);