OVG SN: skip links without PDF and fix PDF links
For SN OVG intermediate pages, skip links where the a element does not enclose text.
Fixes https://github.com/niklaswais/gesp/issues/11
I fixed more cases where it fails by skipping by URL pattern instead, and removing text that doesn't belong to the link.
The latest commit makes the fix even more robust by scanning for the filename pattern in both link text and href attribute (see issue for motivation).
In order to avoid missing files that were previously downloaded, I also collect links to DOCX files again. Some identifiers like ExportAsPdfPipeline or save_as_pdf might like a rename now. On the other hand, it might call for adapting the architecture to better cope with cases where one scraper can output different file formats.
Tahnk you! I'm in the process of re-writing the architecture. Will add it as a new branch for testing in the upcoming weeks