urlwatch icon indicating copy to clipboard operation
urlwatch copied to clipboard

Applying same options to multiple URLs

Open jwilk opened this issue 1 year ago • 2 comments

I'm watching a large number of URLs that have the same structure, so I'm applying the same set of filters to them:

url: https://example.net/9566
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
url: https://example.net/14026
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
url: https://example.net/15829
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'
---
# ...

This is very tiresome to update.

So I wish I could write something like this instead:

url:
- https://example.net/9566
- https://example.net/14026
- https://example.net/15829
# ...
ssl_no_verify: true
filter:
- css: '#data'
- xpath: '//text()'
- format-json: null
- grep: '"(foo|bar)":'
- re.sub: 'T00:00:00(?=")'
- re.sub: '"'
- re.sub: '(?m)^ *|,$'

jwilk avatar Sep 06 '22 21:09 jwilk

One quick and pragmatic way to do this would be to write a small script that generates the urls.yaml from a "template" that you specified like above. This way, you can make it as complex and/or powerful as you want.

The suggestion you had doesn't properly work if e.g. you want to give different URLs different names or something. On the other hand, for this simple case of turning the "url" field into a list, it could work. The job parser needs to be updated to deal with that properly, though (probably the job parser would go and "expand" the data accordingly, so that practically the rest of the codebase "sees" distinct jobs that just happen to have the same filter configuration).

Keeping this open for now as feature idea for the future.

thp avatar Sep 08 '22 08:09 thp

Perhaps this will help @jwilk or others searching for a solution.

I generally use the global job_defaults to apply the same filters to all my URLs (docs). If I have a few URLs that need an additional filter step or perhaps a few URLs that need a certain filter step skipped, I use a custom SelectiveFilter. This is obviously just for my use case, but perhaps the idea can be generalized.

This custom SelectiveFilter allows you to define a list of regex patterns to match. A defined conventional filter is then applied selectively depending on the results of that match.

Not exactly what you want, but I think the concept of making a custom filter in your hooks.py and giving that filter some logic to either apply itself or not is a workable solution.

trevorshannon avatar Feb 05 '24 04:02 trevorshannon