jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

CSS identifier escapes are not supported

Open hoogenbj opened this issue 8 years ago • 4 comments

Hi, when trying to do a select on a document using an id containing a hyphen, I get the following error: Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#ProductSummary-/productsummary/legalEntityId': unexpected token at '/productsummary/legalEntityId' at org.jsoup.select.QueryParser.findElements(QueryParser.java:198) at org.jsoup.select.QueryParser.parse(QueryParser.java:65) at org.jsoup.select.QueryParser.parse(QueryParser.java:39) at org.jsoup.select.Selector.(Selector.java:86) at org.jsoup.select.Selector.select(Selector.java:108) at org.jsoup.nodes.Element.select(Element.java:296) This seems to have been fixed before: see issue #15 . I am using version 1.10.2. Using getElementById() works fine, though.

hoogenbj avatar Feb 23 '17 12:02 hoogenbj

It's not about a hyphen. Slashes cause this exception. You need a correct css selector. Your selector is invalid because slash is being consumed as a part of id, but slashes are not allowed in ids. https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".

Using getElementById() works just by chance because it doesn't check correctness of argument.

krystiangorecki avatar Feb 23 '17 17:02 krystiangorecki

Thanks for your reply. Its not xpath, its just an actual id in the document that I have to process. Escaping doesn't work either. I'll have to go the document's author and see if they can change the ids...

On 23/02/2017 19:41, krystiangor wrote:

It's not about a hyphen. Slashes cause this exception. You need a correct css selector expression, not xpath. If you're sure your expression is correct try to escape slashes.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jhy/jsoup/issues/838#issuecomment-282064913, or mute the thread https://github.com/notifications/unsubscribe-auth/AER-EZuUzlVvCtKN2a2Cy1GBjK1kGZB_ks5rfcSugaJpZM4MJ5td.

hoogenbj avatar Feb 24 '17 05:02 hoogenbj

Looks like the method to parse CSS identifiers is incomplete. https://github.com/jhy/jsoup/blob/f28c024ba127fd701f0d195a359afbabff04d7a1/src/main/java/org/jsoup/parser/TokenQueue.java#L365-L376

The linked version of the CSS specification (CSS2) contains this:

In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F". Note that Unicode is code-by-code equivalent to ISO 10646 (see [UNICODE] and [ISO10646]).

So the / character needs to be escaped in CSS identifiers. Unfortunately, the current code doesn't support that.

cketti avatar Feb 24 '17 06:02 cketti

I'm not sure if it should be its own issue or not, but the incompleteness of consumeCssIdentifier() also causes Element.cssSelector() to fail if any ancestor nodes have an escaped or unicode character. For example:

String html = "<html><body><div class=\"B\\&W\\?\"><div class=\"test\">Parsed HTML into a doc.</div></div></body></html>";
Jsoup.parse(html).select(".test").get(0).cssSelector();

Throws Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query 'div.B\&W\?': unexpected token at '\&W\?'

Because the .cssSelector just creates invalid selectors in its chain and executes them: https://github.com/jhy/jsoup/blob/f28c024ba127fd701f0d195a359afbabff04d7a1/src/main/java/org/jsoup/nodes/Element.java#L532-L534

DulithaRanatunga avatar Apr 06 '18 05:04 DulithaRanatunga

I'm not sure if it should be its own issue or not, but the incompleteness of consumeCssIdentifier() also causes Element.cssSelector() to fail if any ancestor nodes have an escaped or unicode character.

This works now with bc2181dd4be4e702d54edba8b498e64fc568cf96 and preceding commit. Thanks!

jhy avatar Jan 19 '23 04:01 jhy