urlwatch
urlwatch copied to clipboard
html2text option in job_defaults -> url only applied to first job
Hi there,
I am having a rather strange phenomenon, granted my setup is probably a bit substandard.
I have the following config:
display:
error: true
new: true
unchanged: false
job_defaults:
all:
headers:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287
browser: {}
shell: {}
url:
filter:
- element-by-tag: body
- html2text:
method: lynx
- re.sub:
pattern: '(?i)(ist jetzt )(..:..)( Uhr)'
repl: '\1XX:XX\3'
- re.sub:
pattern: '(?i)(Es ist: )(..-..-...., ..:..)'
repl: '\1XX-XX-XXXX, XX:XX'
- strip
report:
...
and my jobs are just:
# A basic URL job just needs a URL
name: "Site1"
url: "https://site1..."
---
name: "Site2"
url: "https://site2..."
---
name: "Site3"
url: "https://site3..."
---
name: "Site4"
url: "https://site4..."
---
name: "Site5"
url: "https://site5..."
---
name: "Site6"
url: "https://site6..."
---
Now I saw strange behaviour and finally found out that the "method: lynx" is only applied to first job. Running urlwatch -v confirms:
:~$ grep html2text urlwatch.log
2020-11-04 11:45:01,550 handler INFO: Processing: <url url='https://site1.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,553 handler INFO: Processing: <url url='https://site2.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,555 handler INFO: Processing: <url url='https://site3.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,555 handler INFO: Processing: <url url='https://site4.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,556 handler INFO: Processing: <url url='https://site5.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,558 handler INFO: Processing: <url url='https://site6.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,725 filters INFO: Applying filter 'html2text', subfilter {'method': 'lynx'} to https://site5.../
2020-11-04 11:45:01,747 worker DEBUG: Job finished: <url url='https://site5.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,747 worker DEBUG: Using max_tries of 0 for <url url='https://site5.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,048 filters INFO: Applying filter 'html2text', subfilter {} to https://site2.../
2020-11-04 11:45:02,050 worker DEBUG: Job finished: <url url='https://site2.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,050 worker DEBUG: Using max_tries of 0 for <url url='https://site2.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,511 filters INFO: Applying filter 'html2text', subfilter {} to https://site3.../
2020-11-04 11:45:02,512 worker DEBUG: Job finished: <url url='https://site3.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,513 worker DEBUG: Using max_tries of 0 for <url url='https://site3.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,520 filters INFO: Applying filter 'html2text', subfilter {} to https://site1.../
2020-11-04 11:45:02,556 worker DEBUG: Job finished: <url url='https://site1.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,557 worker DEBUG: Using max_tries of 0 for <url url='https://site1.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:03,916 filters INFO: Applying filter 'html2text', subfilter {} to https://site6.../
2020-11-04 11:45:03,917 worker DEBUG: Job finished: <url url='https://site6.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:03,917 worker DEBUG: Using max_tries of 0 for <url url='https://site6.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:06,429 filters INFO: Applying filter 'html2text', subfilter {} to https://site4.../
2020-11-04 11:45:06,431 worker DEBUG: Job finished: <url url='https://site4.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:06,431 worker DEBUG: Using max_tries of 0 for <url url='https://site4.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
So what happens is, that it starts with Site5 (?) uses the correct method, then goes on to forget about the option...
using 2.21