urlwatch icon indicating copy to clipboard operation
urlwatch copied to clipboard

html2text option in job_defaults -> url only applied to first job

Open kongomongo opened this issue 5 years ago • 0 comments

Hi there,

I am having a rather strange phenomenon, granted my setup is probably a bit substandard.

I have the following config:

display:
  error: true
  new: true
  unchanged: false
job_defaults:
  all:
    headers:
      User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287
  browser: {}
  shell: {}
  url:
    filter:
      - element-by-tag: body
      - html2text:
          method: lynx
      - re.sub:
          pattern: '(?i)(ist jetzt )(..:..)( Uhr)'
          repl: '\1XX:XX\3'
      - re.sub:
          pattern: '(?i)(Es ist: )(..-..-...., ..:..)'
          repl: '\1XX-XX-XXXX, XX:XX'
      - strip
report:
...

and my jobs are just:

# A basic URL job just needs a URL
name: "Site1"
url: "https://site1..."
---
name: "Site2"
url: "https://site2..."
---
name: "Site3"
url: "https://site3..."
---
name: "Site4"
url: "https://site4..."
---
name: "Site5"
url: "https://site5..."
---
name: "Site6"
url: "https://site6..."
---

Now I saw strange behaviour and finally found out that the "method: lynx" is only applied to first job. Running urlwatch -v confirms:

:~$ grep html2text urlwatch.log
2020-11-04 11:45:01,550 handler INFO: Processing: <url url='https://site1.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,553 handler INFO: Processing: <url url='https://site2.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,555 handler INFO: Processing: <url url='https://site3.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,555 handler INFO: Processing: <url url='https://site4.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,556 handler INFO: Processing: <url url='https://site5.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,558 handler INFO: Processing: <url url='https://site6.../' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {'method': 'lynx'}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,725 filters INFO: Applying filter 'html2text', subfilter {'method': 'lynx'} to https://site5.../
2020-11-04 11:45:01,747 worker DEBUG: Job finished: <url url='https://site5.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:01,747 worker DEBUG: Using max_tries of 0 for <url url='https://site5.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site5' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,048 filters INFO: Applying filter 'html2text', subfilter {} to https://site2.../
2020-11-04 11:45:02,050 worker DEBUG: Job finished: <url url='https://site2.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,050 worker DEBUG: Using max_tries of 0 for <url url='https://site2.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site2' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,511 filters INFO: Applying filter 'html2text', subfilter {} to https://site3.../
2020-11-04 11:45:02,512 worker DEBUG: Job finished: <url url='https://site3.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,513 worker DEBUG: Using max_tries of 0 for <url url='https://site3.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site3' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,520 filters INFO: Applying filter 'html2text', subfilter {} to https://site1.../
2020-11-04 11:45:02,556 worker DEBUG: Job finished: <url url='https://site1.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:02,557 worker DEBUG: Using max_tries of 0 for <url url='https://site1.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site1' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:03,916 filters INFO: Applying filter 'html2text', subfilter {} to https://site6.../
2020-11-04 11:45:03,917 worker DEBUG: Job finished: <url url='https://site6.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:03,917 worker DEBUG: Using max_tries of 0 for <url url='https://site6.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site6' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:06,429 filters INFO: Applying filter 'html2text', subfilter {} to https://site4.../
2020-11-04 11:45:06,431 worker DEBUG: Job finished: <url url='https://site4.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>
2020-11-04 11:45:06,431 worker DEBUG: Using max_tries of 0 for <url url='https://site4.../' method='GET' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36 OPR/71.0.3770.287'} name='Site4' filter=[{'element-by-tag': 'body'}, {'html2text': {}}, {'re.sub': {'pattern': '(?i)(ist jetzt )(..:..)( Uhr)', 'repl': '\\1XX:XX\\3'}}, {'re.sub': {'pattern': '(?i)(Es ist: )(..-..-...., ..:..)', 'repl': '\\1XX-XX-XXXX, XX:XX'}}, 'strip']>

So what happens is, that it starts with Site5 (?) uses the correct method, then goes on to forget about the option...

using 2.21

kongomongo avatar Nov 04 '20 11:11 kongomongo