urlwatch
urlwatch copied to clipboard
Pdf2TextFilter error handling
I wanted to use urlwatch like this:
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202205.pdf
filter:
- pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202206.pdf
filter:
- pdf2text
---
url: https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
filter:
- pdf2text
to wait for next bulletins, but when a bulletin is not available yet meteofrance.fr redirects to an HTML page, leading to an exeption:
Traceback (most recent call last):
File "urlwatch/handler.py", line 120, in process
data = FilterBase.process(filter_kind, subfilter, self, data)
File "urlwatch/filters.py", line 188, in process
return filtercls(state.job, state).filter(data, subfilter)
File "urlwatch/filters.py", line 399, in filter
return '\n\n'.join(pdftotext.PDF(io.BytesIO(data), password=subfilter.get('password', '')))
pdftotext.Error: poppler error creating document
maybe the pdf2text need some kind of error handling option, what do you think?
Hm, it seems like the page returns a "proper" 404 status code when the PDF isn't yet available.
I do wonder why the filter is run in this case. Have you maybe set things up to ignore 404 errors?
No, it they don't return a proper 404, in fact it really depends on the request:
$ curl -I https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 404 Not Found
...
$ curl -i https://donneespubliques.meteofrance.fr/donnees_libres/bulletins/BCM/202207.pdf
HTTP/1.1 302 Found
Date: Tue, 19 Apr 2022 20:08:54 GMT
Location: https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible
...
And in urlwatch, in the job retrieve function we're hitting the 2nd case:
(Pdb) p response
<Response [200]>
(Pdb) p response.url
'https://donneespubliques.meteofrance.fr/?fond=donnee_indisponible'
(Pdb) p response.history
[<Response [302]>]