mlscraper icon indicating copy to clipboard operation
mlscraper copied to clipboard

Find better selectors

Open lorey opened this issue 3 years ago • 2 comments

Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

Maybe there's a heuristic for good selectors. An idea: What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

lorey avatar Jun 24 '22 11:06 lorey

Converting a dynamic selector to a * based selector may work.

decentralizeCss(`h2.heading--2eONR.heading-2--1OnX8.title--3yncE.block--3v-Ow`)

// h2[class*="heading--"][class*="heading-2--"][class*="title--"][class*="block--"]

Ofc it depends on the implementation of this function, I tested it a few times and it worked.

entrptaher avatar May 01 '23 08:05 entrptaher

I think with the soupsieve implementation, that's computationally very expensive. For something java-based with xslt, it might make more sense.

lorey avatar May 01 '23 14:05 lorey