python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Collection of Python code to re-use across Python-based scrapers

Results 52 python-scraperlib issues
Sort by recently updated
recently updated
newest added

All scrapers are setting ZIM tags based on a user-provided string with semi-colon separator between values (or at least they should). Some scrapers are also setting few tags automatically, in...

enhancement
good first issue

In warc2zim, we have a tooling which confirms that: - output folder exists - output folder is writable - ZIM name is acceptable by current filesystem This is run at...

enhancement
good first issue

There's a few things not working right yet, and this isn't passing pre-commits yet, so I'm pushing this just for advice at this time.

Our image functions all assume bitmap input or output. Should we support vector formats as well (SVG's the only open and widespread one I think)? If so, should it be...

question

[`reencode()`](https://github.com/openzim/python-scraperlib/blob/main/src/zimscraperlib/video/encoding.py#L40) uses a temporary file to encode into. Once ffmpeg complete, if all went well, that temp file is copied into destination. - Copying is very safe (considering we could...

enhancement

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term). Currently scraperlib is using the `len` function which...

bug
question

We should probably consider to automatically drop any control characters found in a string metadata (title, description, ...). Or raise an error...

enhancement
question

As discussed in https://github.com/openzim/warc2zim/issues/123, we would benefit from logging the metadata which are used, at least all text values. Regarding illustration, do we want to log the base64 value? It...

enhancement

When computing the list of tags, it could help to deduplicate them, so that they are not "doubled" by mistake. https://github.com/openzim/ted/blob/60fb82a127b371907c8d24ba70b4e50d29ff5005/src/ted2zim/scraper.py#L93

enhancement
good first issue

https://github.com/openzim/python-scraperlib/blob/7d498319baadba715316c15cf9857ff2f6974a00/README.md?plain=1#L60 1. My Debian install uses the "externally managed" python3 install option, so it seems that I need `pipx` rather than `pip`. Neither was installed and I had to find...

question