soupsieve icon indicating copy to clipboard operation
soupsieve copied to clipboard

Add a custom :contains-regexp() pseudo class?

Open facelessuser opened this issue 6 years ago • 5 comments

This is open currently as an exploratory idea. This would be a custom pseudo-class that would allow for regular expression searches of content. The idea would probably not be to include regular expression directly in the pattern, but most likely references to compiled patterns:

pattern = re.compile(r'some .*? pattern')
regexp = {'content_pattern': pattern}
sv.compile('p:-regex(content_pattern)', regexp=regexp)

Do we make this like contains, and have it search all children of p looking for the pattern, or do we constrain it to the target element of p? Or do we have two variants that do all children or only the target: :-regexp() and :-regexp-direct (or some other name that gets the idea across).

Anyways this is just an idea, but maybe in the future (if we flesh this out enough), we can implement this.

facelessuser avatar Feb 22 '19 05:02 facelessuser

It's important to note Beautiful Soup already provides regex, we don't need this, but it might be nice to incorporate regex in some way for selectors as well. We just need to decide if we are willing to pay to commit to a solution, and what that solution should look like.

facelessuser avatar Feb 23 '19 02:02 facelessuser

If we do this, a name like :contains-regexp() might be more descriptive and make more sense.

When defining regex keywords, should we require them to be in the form of custom CSS variables: --regex-key? As far as I know, we will never really have a need for regex variables in our scheme. Maybe we should require some other kind of variable prefix $key 🤷‍♂️ .

Or we could extend custom maybe? If you give a regex pattern instead of selector string, it searches a tag's content? Just some ideas.

facelessuser avatar Feb 25 '19 15:02 facelessuser

Thinking about this more, we really could use custom selectors to do regex. Currently we take a string for a given custom pseudo-class, but we could accept an custom pseudo-class object as well. The object could take a selector, a text search value regex or string. You could even extend it to allow attribute values as well:

So just thinking out loud here. Assuming custom is a hashable object

import soupsieve as sv
import re

custom = {
    ':--custom-pseudo': sv.CustomPseudo(
        'p.class',
        text=re.compile(r'test-[a-z\d]+', re.I),
        attr={'data-item': re.compile(r'1[0-9]{2}')}
    )
}

sv.compile('article div > :custom-pseudo', custom=custom)

It may even be possible to allow a custom function, but I'm not sure yet. As long as things remained hashable and pickle-able, it would be doable, but I imagined this may not always behave proper sending in a function, as the patterns get cached. Caching a pattern with a function does not guarantee you'd get the same behavior....I think I'd pass on functions for now.

facelessuser avatar Feb 25 '19 16:02 facelessuser

Another possibility is to extend contains and the attribute equal case to accept custom template variables: $var.

You would define regular expressions with custom variable names which could be a valid identifier with a $ prefix.

regexp = {
    'content-pattern': re.compile(r'test-[a-z\d]+', re.I),
    'attr-pattern': re.compile(r'1[0-9]{2}')
}

sv.compile('p:contains($content-pattern)[data-item=$attr-pattern]', regexp=regexp)

Maybe this is the most straight forward approach? If nothing, it is another option. Custom patterns may still need a way to provide regex when defining them.

facelessuser avatar Feb 26 '19 14:02 facelessuser

If we end up doing #175, this would not be needed.

facelessuser avatar Jan 26 '20 22:01 facelessuser