urlwatch Pdf2TextFilter error handling

Pdf2TextFilter error handling

Open JulienPalard opened this issue 2 years ago • 2 comments

I wanted to use urlwatch like this:

---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202205.pdf
filter:
  - pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202206.pdf
filter:
  - pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
filter:
  - pdf2text

to wait for next bulletins, but when a bulletin is not available yet meteofrance.fr redirects to an HTML page, leading to an exeption:

Traceback (most recent call last):
  File "urlwatch/handler.py", line 120, in process
    data = FilterBase.process(filter_kind, subfilter, self, data)
  File "urlwatch/filters.py", line 188, in process
    return filtercls(state.job, state).filter(data, subfilter)
  File "urlwatch/filters.py", line 399, in filter
    return '\n\n'.join(pdftotext.PDF(io.BytesIO(data), password=subfilter.get('password', '')))
pdftotext.Error: poppler error creating document

maybe the pdf2text need some kind of error handling option, what do you think?

Apr 13 '22 20:04 JulienPalard

Hm, it seems like the page returns a "proper" 404 status code when the PDF isn't yet available.

I do wonder why the filter is run in this case. Have you maybe set things up to ignore 404 errors?

Apr 18 '22 09:04 thp

No, it they don't return a proper 404, in fact it really depends on the request:

$ curl -I https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 404 Not Found
...

$ curl -i https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 302 Found
Date: Tue, 19 Apr 2022 20:08:54 GMT
Location: https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible
...

And in urlwatch, in the job retrieve function we're hitting the 2nd case:

(Pdb) p response
<Response [200]>
(Pdb) p response.url
'https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible'
(Pdb) p response.history
[<Response [302]>]

Apr 19 '22 20:04 JulienPalard

urlwatch urlwatch copied to clipboard

Pdf2TextFilter error handling

urlwatch
urlwatch copied to clipboard