python-scraperlib issues

Add a scraper check utility

Currently, we rely on various objects in scraperlib to: - create the ZIM - re-encode videos and images - cache these assets on the optimization cache We might consider to...

benoit74

enhancement

question

Automatically redirect to articles with same checksum

7

As discussed in https://github.com/openzim/sotoki/pull/162#issuecomment-660452579, it actually seems a bit odd to handle duplicate files in the scrapers. We can instead have a system to redirect have a single copy of...

satyamtg

enhancement

PEP585: remove use of deprecated imports from `typing`

[PEP585](https://peps.python.org/pep-0585/) introduced support for the generics syntax in all standard collections currently available in the `typing` module. At the same time, it deprecated the use of `typing` for all these...

benoit74

enhancement

good first issue

Consider using VP9 in place of VP8 vor videos

15

It offers a better compression

kelson42

enhancement

question

Comprehensive benchmark of image presets

7

Testing Webp support on youtube showed that `WebpHigh` doesn't produce *high* quality thumbnail. As these image presets are going to be used everywhere, it's important that, now that the rest...

rgaudin

question

stale

Add SVG optimizer

1

As we're seeing SVG as source images in scraper source, it'd good to have an SVG optimizer/cleaner. Probably lossless only at this point. - [svgo](https://github.com/svg/svgo) probably most popular (node) -...

rgaudin

enhancement

stale

Add S3 based optimization cache support

1

We use kiwix_storagelib for implementing S3 based optimization cache in the scrapers. However, this gives rise to redundant code. We put a version of the file along with the optimizer...

satyamtg

enhancement

stale

delete_callback should support to ignore silently the fact that fpath is missing

1

We have a helper delete_callback at https://github.com/openzim/python-scraperlib/blob/335d5271e106b374f1aca871d19557ff2c81582d/src/zimscraperlib/filesystem.py#L47 This delete_callback is meant to be used as a callback when adding an item to the ZIM, typically to delete original file once...

benoit74

enhancement

`StaticItem` seems to be way slower due to automatic indexing

Since 4.0.0, it looks like automatic indexing of PDFs has made the scraperlib significantly slower to process items. It is probably linked to the fact that with current 4.0.0 implementation...

benoit74

bug

README does not mention the dependency on libcairo

The README should mention that libcairo is mandatory for SVG operations.

benoit74

bug

python-scraperlib
python-scraperlib copied to clipboard

Metadata

Add a scraper check utility

Automatically redirect to articles with same checksum

PEP585: remove use of deprecated imports from `typing`

Consider using VP9 in place of VP8 vor videos

Comprehensive benchmark of image presets

Add SVG optimizer

Add S3 based optimization cache support

delete_callback should support to ignore silently the fact that fpath is missing

`StaticItem` seems to be way slower due to automatic indexing

README does not mention the dependency on libcairo

← Metadata

Owner

Metadata

python-scraperlib python-scraperlib copied to clipboard

Metadata

← Metadata

Owner

Metadata

python-scraperlib
python-scraperlib copied to clipboard