jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Element.hasClass ignores html strict mode

Open evaknin opened this issue 4 years ago • 7 comments

Hi, Jsoup ignores case sensitive class selector. This happens regardless if we use strict html mode or not (<!DOCTYPE html>). It causes different behaviour from a browser behaviour when using strict mode.

For example: <!DOCTYPE html> <html><head><style type="text/css"> .c1{ font-size:44px; } .C1{ color:red; } </style></head><body> <div class="c1"> Some text </div></body></html>

The following will fetch the div, although the c is in lowercase in the div: document.select(".C1");

My findings: The class evaluator matches method calls Element.hasClass. Element.has class checks for a match - ignoring the case sensitive.

evaknin avatar Nov 02 '20 07:11 evaknin

Excuse me. I have reproduced the behaviour by:

public static void main(String[] args) throws IOException {
        String path = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
                "    <style type=\"text/css\">\n" +
                "        .c1 {\n" +
                "            font-size: 44px;\n" +
                "        }\n" +
                "\n" +
                "        .C1 {\n" +
                "            color: #ffa578;\n" +
                "        }\n" +
                "    </style>\n" +
                "</head>\n" +
                "<body>\n" +
                "<div class=\"c1\">\n" +
                "    Some text\n" +
                "</div>\n" +
                "</body>\n" +
                "</html>";
        Document doc = Jsoup.parse(path);
        System.out.println(doc.select("[class=C1]").get(0).text());
        System.out.println(doc.select("[class=c1]").get(0).text());
    }

Could you tell me how to use html strict mode so I can test and add some features for Jsoup.select()?

LIKP0 avatar Apr 16 '21 08:04 LIKP0

Hi, In html5, we set strict mode by adding at the beginning of the html. If we remove it, we don't use strict mode.

evaknin avatar Apr 16 '21 09:04 evaknin

Hi, I think jsoup currently does not support case-sensitive select() and does not depend on whether it is html strict mode. from here you can see that selectors in jsoup are case insensitive.

For simplicity, maybe you can do text replacement before select, and replace the uppercase or lowercase search content with different content to eliminate conflicts, or you can nest another case sensitive method after selection.

There is no doubt that your findings are correct. In source code of jsoup 1.13.1 (the latest version so far), if we change the 1374th line of Element.java from "return className.equalsIgnoreCase(classAttr);" to "return className.equals(classAttr);" then the problem with the example you gave is solved. Class ".c1" with a lowercase c in it will not be selected by document.select(".C1"); any more.

If we want to solve this problem completely, we need to add an case sensitive option to the selectors in jsoup. Due to default parameters are not supported in Java, and for not to distrubing old funtions, overloading the hasxxx methods seems a good solution.

For example:

public boolean hasClass(String className) {
    return this.hasClass(className, false);
}

public boolean hasClass(String className, boolean caseSensitive) {
    //some code here

    if (len == wantLen) {
        if(caseSensitive)
            return className.equals(classAttr);
        return className.equalsIgnoreCase(classAttr);
    }

    //some code here
    }
}

But then a branch of methods need to be modified like this, since the methods are nested and we need to pass the boolean value from head to tail. This will make jsoup more complex, I'm not sure if it will bring some bad effects.

Also, as far as I know, HTML class names are case-sensitive, while CSS selectors are generally case-insensitive. My suggestion is that we should always write code case sensitively.

RyderCRD avatar Apr 18 '21 02:04 RyderCRD

I have tried to fix this issue, following is my pull request. #1527 Now you can case-sensitively select classes with .select(".classname", true) if you want.

RyderCRD avatar Apr 25 '21 10:04 RyderCRD

Here‘s the code. Hope this helps you.

RyderCRD avatar Apr 25 '21 11:04 RyderCRD

Great. Thanks :)

evaknin avatar Apr 25 '21 18:04 evaknin

You're welcome! Just a reminder, you may also write like this to automatically determine whether to use strict mode.

        boolean htmlStrictMode;
        try{
            htmlStrictMode = doc.documentType().name().equals("html");
        }catch (NullPointerException e) {
            htmlStrictMode = false;
        }
        doc.select(".classname", htmlStrictMode);

RyderCRD avatar Apr 26 '21 02:04 RyderCRD