python-scraperlib
python-scraperlib copied to clipboard
Collection of Python code to re-use across Python-based scrapers
All scrapers are setting ZIM tags based on a user-provided string with semi-colon separator between values (or at least they should). Some scrapers are also setting few tags automatically, in...
In warc2zim, we have a tooling which confirms that: - output folder exists - output folder is writable - ZIM name is acceptable by current filesystem This is run at...
There's a few things not working right yet, and this isn't passing pre-commits yet, so I'm pushing this just for advice at this time.
Our image functions all assume bitmap input or output. Should we support vector formats as well (SVG's the only open and widespread one I think)? If so, should it be...
[`reencode()`](https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/video/encoding.py#L40) uses a temporary file to encode into. Once ffmpeg complete, if all went well, that temp file is copied into destination. - Copying is very safe (considering we could...
The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term). Currently scraperlib is using the `len` function which...
We should probably consider to automatically drop any control characters found in a string metadata (title, description, ...). Or raise an error...
As discussed in https://github.com/openzim/warc2zim/issues/123, we would benefit from logging the metadata which are used, at least all text values. Regarding illustration, do we want to log the base64 value? It...
When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake. https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93
https://github.com/openzim/python-scraperlib/blob/7d498319baadba715316c15cf9857ff2f6974a00/README.md?plain=1#L60 1. My Debian install uses the "externally managed" python3 install option, so it seems that I need `pipx` rather than `pip`. Neither was installed and I had to find...