jsoup
jsoup copied to clipboard
CSS identifier escapes are not supported
Hi,
when trying to do a select on a document using an id containing a hyphen, I get the following error:
Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query '#ProductSummary-/productsummary/legalEntityId': unexpected token at '/productsummary/legalEntityId'
at org.jsoup.select.QueryParser.findElements(QueryParser.java:198)
at org.jsoup.select.QueryParser.parse(QueryParser.java:65)
at org.jsoup.select.QueryParser.parse(QueryParser.java:39)
at org.jsoup.select.Selector.
It's not about a hyphen. Slashes cause this exception. You need a correct css selector. Your selector is invalid because slash is being consumed as a part of id, but slashes are not allowed in ids. https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier
In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F".
Using getElementById() works just by chance because it doesn't check correctness of argument.
Thanks for your reply. Its not xpath, its just an actual id in the document that I have to process. Escaping doesn't work either. I'll have to go the document's author and see if they can change the ids...
On 23/02/2017 19:41, krystiangor wrote:
It's not about a hyphen. Slashes cause this exception. You need a correct css selector expression, not xpath. If you're sure your expression is correct try to escape slashes.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jhy/jsoup/issues/838#issuecomment-282064913, or mute the thread https://github.com/notifications/unsubscribe-auth/AER-EZuUzlVvCtKN2a2Cy1GBjK1kGZB_ks5rfcSugaJpZM4MJ5td.
Looks like the method to parse CSS identifiers is incomplete. https://github.com/jhy/jsoup/blob/f28c024ba127fd701f0d195a359afbabff04d7a1/src/main/java/org/jsoup/parser/TokenQueue.java#L365-L376
The linked version of the CSS specification (CSS2) contains this:
In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B&W?" or "B\26 W\3F". Note that Unicode is code-by-code equivalent to ISO 10646 (see [UNICODE] and [ISO10646]).
So the /
character needs to be escaped in CSS identifiers. Unfortunately, the current code doesn't support that.
I'm not sure if it should be its own issue or not, but the incompleteness of consumeCssIdentifier()
also causes Element.cssSelector()
to fail if any ancestor nodes have an escaped or unicode character. For example:
String html = "<html><body><div class=\"B\\&W\\?\"><div class=\"test\">Parsed HTML into a doc.</div></div></body></html>";
Jsoup.parse(html).select(".test").get(0).cssSelector();
Throws Exception in thread "main" org.jsoup.select.Selector$SelectorParseException: Could not parse query 'div.B\&W\?': unexpected token at '\&W\?'
Because the .cssSelector just creates invalid selectors in its chain and executes them: https://github.com/jhy/jsoup/blob/f28c024ba127fd701f0d195a359afbabff04d7a1/src/main/java/org/jsoup/nodes/Element.java#L532-L534
I'm not sure if it should be its own issue or not, but the incompleteness of
consumeCssIdentifier()
also causesElement.cssSelector()
to fail if any ancestor nodes have an escaped or unicode character.
This works now with bc2181dd4be4e702d54edba8b498e64fc568cf96 and preceding commit. Thanks!