gesp icon indicating copy to clipboard operation
gesp copied to clipboard

OVG SN: skip links without PDF and fix PDF links

Open zeuner opened this issue 1 year ago • 3 comments

For SN OVG intermediate pages, skip links where the a element does not enclose text.

Fixes https://github.com/niklaswais/gesp/issues/11

zeuner avatar Sep 22 '24 10:09 zeuner

I fixed more cases where it fails by skipping by URL pattern instead, and removing text that doesn't belong to the link.

zeuner avatar Sep 22 '24 17:09 zeuner

The latest commit makes the fix even more robust by scanning for the filename pattern in both link text and href attribute (see issue for motivation).

zeuner avatar Sep 23 '24 06:09 zeuner

In order to avoid missing files that were previously downloaded, I also collect links to DOCX files again. Some identifiers like ExportAsPdfPipeline or save_as_pdf might like a rename now. On the other hand, it might call for adapting the architecture to better cope with cases where one scraper can output different file formats.

zeuner avatar Sep 23 '24 22:09 zeuner

Tahnk you! I'm in the process of re-writing the architecture. Will add it as a new branch for testing in the upcoming weeks

niklaswais avatar Nov 24 '24 15:11 niklaswais