gesp OVG SN: skip links without PDF and fix PDF links

For SN OVG intermediate pages, skip links where the a element does not enclose text.

Fixes https://github.com/niklaswais/gesp/issues/11

Sep 22 '24 10:09 zeuner

I fixed more cases where it fails by skipping by URL pattern instead, and removing text that doesn't belong to the link.

Sep 22 '24 17:09 zeuner

The latest commit makes the fix even more robust by scanning for the filename pattern in both link text and href attribute (see issue for motivation).

Sep 23 '24 06:09 zeuner

In order to avoid missing files that were previously downloaded, I also collect links to DOCX files again. Some identifiers like ExportAsPdfPipeline or save_as_pdf might like a rename now. On the other hand, it might call for adapting the architecture to better cope with cases where one scraper can output different file formats.

Sep 23 '24 22:09 zeuner

Tahnk you! I'm in the process of re-writing the architecture. Will add it as a new branch for testing in the upcoming weeks

Nov 24 '24 15:11 niklaswais