Browser4 icon indicating copy to clipboard operation
Browser4 copied to clipboard

Handle non-standard css selectors

Open galaxyeye opened this issue 3 years ago • 1 comments

Some websites use selectors what do not match the standard. For example,

<div class='KAHaP+'></div>

the charactor "+" is not allowed in a class name so Jsoup throws a SelectorParseException, and pulsar-dom throws a PowerSelectorParseException.

We found the issue when handle with jd.com and shopee.sg.

Jsoup follows the CSS2 value defination standard: https://www.w3.org/TR/CSS2/syndata.html#value-def-identifier

In CSS, identifiers (including element names, classes, and IDs in [selectors](https://www.w3.org/TR/CSS2/selector.html)) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

For more about valid characters in a CSS selector: https://pineco.de/css-quick-tip-the-valid-characters-in-a-custom-css-selector/ A selector will look something like this: -?[_a-zA-Z]+[_-a-zA-Z0-9]*

galaxyeye avatar May 05 '22 10:05 galaxyeye

The last fix caused a new bug: it breaks the adjacent sibling selector (+), for example "div + p".

platonai avatar Aug 04 '22 09:08 platonai