jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

`cssSelector` doesn't handle combining characters correctly

Open samshutchins opened this issue 2 years ago โ€ข 1 comments

    @Test
    void combiningCharactersInIdentifier()
    {
        final String html = """
            <html>
            <head>
            <meta charset="utf-8">
            </head>
                        
            <body>
            <img class="e\u0301" src="/corner.jpg">
            </body>
                        
            </html>""";

        final Document document = Jsoup.parse(html);
        final Elements images = document.getElementsByTag("img");

        final Element img = images.get(0);
        final String cssSelector = img.cssSelector();

        assertEquals("html > body > img.e\u0301", cssSelector);
    }

The example above uses combining characters to create an รฉ. Emoji make heavy use of combining characters (๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ง is made up of 11 characters: \uD83D\uDC68\u200D\uD83D\uDC68\u200D\uD83D\uDC67\u200D\uD83D\uDC67).

I have seen emoji used as css class names in the wild, and I think the character escaping code is doing the wrong thing when calling cssSelector, it looks like it's escaping every character individually, which breaks things with these combining characters.

samshutchins avatar Jul 25 '23 15:07 samshutchins

Current jsoup: html > body > img.e\ฬ Chrome: body > p.e\\u0301

I don't think it's incorrect to emit it as a run of characters. And the selector does work in jsoup. We could improve to escape the combining form as a \u escape character, like Chrome is.

jhy avatar Oct 20 '23 01:10 jhy