lambdasoup
lambdasoup copied to clipboard
Selector list
Hi,
I started using Lambda Soup and found that it does not seems to support selector lists, like ".bg1, .bg3".
I need to parse an HTML document with various <div> with bg2 bg1 bgbc bg3 classes and want to keep only the bg1 and bg3 ones while keeping the order.
I am wondering if it would be easy to implement this feature?
Yes, it should be fairly straightforward. One would have to:
-
Extend the grammar of selectors with one more level: https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L489
simple_selectoris stuff like.class-foo,[attribute-bar], combinators are>,+, etc. So, this grammar is capable of representing things like.class-foo > [attribute-bar]. It needs one more level oflistto be able to represent comma-separated lists of these. -
This is the parser top-level function. It needs to be modified to become not the top-level function, but a parser for a single item delimited by
,, and then a new top-level function needs to wrap it, that reads commas, and calls the current parser for reading everything in between. https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L896-L913 -
This is the select code. Its logic needs to be wrapped in a new top-level loop that tries additional selectors from the new top-level
listif the preceding ones didn't yield a match. https://github.com/aantron/lambda-soup/blob/8084d5b86ce8f1223271fc1e67398ac618dacbda/src/soup.ml#L611-L647
Thanks, I'll take a look ASAP.