mlscraper
mlscraper copied to clipboard
Find better selectors
Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.
Maybe there's a heuristic for good selectors. An idea:
What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.
Converting a dynamic selector to a * based selector may work.
decentralizeCss(`h2.heading--2eONR.heading-2--1OnX8.title--3yncE.block--3v-Ow`)
// h2[class*="heading--"][class*="heading-2--"][class*="title--"][class*="block--"]
Ofc it depends on the implementation of this function, I tested it a few times and it worked.
I think with the soupsieve implementation, that's computationally very expensive. For something java-based with xslt, it might make more sense.