python-scraperlib icon indicating copy to clipboard operation
python-scraperlib copied to clipboard

Collection of Python code to re-use across Python-based scrapers

Results 52 python-scraperlib issues
Sort by recently updated
recently updated
newest added

I had to skip this test which is failing for now with most recent libzim: https://github.com/openzim/python-scraperlib/blob/fef63f81fdb9dd6d2a5e17d9c8785e3fd22665e9/tests/zim/test_indexing.py#L114-L144 This looks like an upstream issue, hopefully only at read time: https://github.com/openzim/libzim/issues/981

upstream
regression

This issue serves as a checklist for the release event. - [ ] Secure the CI is green on git `main` - [ ] Check that dependencies ranges are ok,...

task

Ruff / Flake8 has a new rule `A005`: https://docs.astral.sh/ruff/rules/stdlib-module-shadowing/ It is recommended to not shadow Python standard-library modules. Currently, we have 5 issues: ``` src/zimscraperlib/html.py:1:1: A005 Module `html` shadows a...

enhancement

scraperlib grew significantly over the years, with modules very useful even beyond pure scraper usages. The ZIM wrapper is very useful in itself to name only one. We are also...

enhancement

Following #227, scraperlib users are not allowed to send values of unexpected types to our API. We should apply the same strict treatment to our usage of other's APIs. beartype...

bug

Ours scrapers all depend on S3 and use [`KiwixStorage`](https://github.com/openzim/python-storagelib) for it. That wrapper and repo are mostly untouched (yet working) and would great benefit from being integrated into scraperlib: tests,...

enhancement
question

For files hosted on upload.wikimedia.org, we must comply with their User-Agent policy at https://meta.wikimedia.org/wiki/User-Agent_policy Doing so at scraperlib level in `stream_file` (main methods using in many scraper to download files...

enhancement
question

This PR enrich the scraperlib with a `ScraperExecutor`. This class is capable to process tasks in parallel, with a given number of worker **threads**. This executor is mainly inspired from...

https://app.readthedocs.org/projects/python-scraperlib/

bug

Currently we have access to a low quality option, and the default of high quality, but something in the middle feels like it'd be better for backup uses, like the...

bug
enhancement