python-scraperlib
python-scraperlib copied to clipboard
Collection of Python code to re-use across Python-based scrapers
Currently, we rely on various objects in scraperlib to: - create the ZIM - re-encode videos and images - cache these assets on the optimization cache We might consider to...
As discussed in https://github.com/openzim/sotoki/pull/162#issuecomment-660452579, it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of...
[PEP585](https://peps.python.org/pep-0585/) introduced support for the generics syntax in all standard collections currently available in the `typing` module. At the same time, it deprecated the use of `typing` for all these...
It offers a better compression
Testing Webp support on youtube showed that `WebpHigh` doesn't produce *high* quality thumbnail. As these image presets are going to be used everywhere, it's important that, now that the rest...
As we're seeing SVG as source images in scraper source, it'd good to have an SVG optimizer/cleaner. Probably lossless only at this point. - [svgo](https://github.com/svg/svgo) probably most popular (node) -...
We use kiwix_storagelib for implementing S3 based optimization cache in the scrapers. However, this gives rise to redundant code. We put a version of the file along with the optimizer...
We have a helper delete_callback at https://github.com/openzim/python-scraperlib/blob/335d5271e106b374f1aca871d19557ff2c81582d/src/zimscraperlib/filesystem.py#L47 This delete_callback is meant to be used as a callback when adding an item to the ZIM, typically to delete original file once...
Since 4.0.0, it looks like automatic indexing of PDFs has made the scraperlib significantly slower to process items. It is probably linked to the fact that with current 4.0.0 implementation...
The README should mention that libcairo is mandatory for SVG operations.