benoit74
benoit74
While Zimit has support for a `--profile` to set Browsertrix Browser profile, this parameter can only be a path to a local file. We should add support for URLs which...
https://github.com/openzim/warc2zim/issues/207 proved that warc2zim tests are not sufficient to ensure we produce a ZIM as expected under all conditions. We could enhance zimit integration tests to also assert the list...
We have multiple instances where a Browsertrix crawl ends-up with this kind of errors: ``` {"timestamp":"2024-01-15T08:17:30.893Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/pl/2020/04/fruit-trenches-cultivating-subtropical-plants-in-freezing-temperatures/","workerid":0}} {"timestamp":"2024-01-15T08:17:30.991Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/pl/2020/01/how-sustainable-is-a-solar-powered-website/","workerid":0}} {"timestamp":"2024-01-15T08:17:31.478Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n at #onTargetCrashed...
In addition to #266 issue which has often been encountered on solar.lowtechmagazine.com [recipe](https://farm.openzim.org/recipes/solar.lowtechmagazine.com), the situation became clear the situation is even worse with last ZIM update I tried to perform...
See https://farm.youzim.it/pipeline/588ad4d1-0f4a-4705-8cc6-183d97800cab/debug Error is: ``` failed to connect to https://mr3.pw/female-escorts-in-colombo: Invalid leading whitespace, reserved character(s), or returncharacter(s) in header value: ' Youzim.it+ [email protected]' ``` (i.e. this is a Python error,...
Currently, Browsertrix crawler parameters allow us a fine control over which pages are fetched into the ZIM. However, all resources found on the page are fetched. This could pose issues...
Zimfarm recipe: https://farm.openzim.org/recipes/courses.lumenlearning.com_en_all We have a crawling issue on this recipe, the crawl retrieves only very few pages and it looks like all courses are missing.
Zimit version: 1.6.2 (not yet released, just to have the fix for `--depth 0` + crawler 0.12.2) While doing a ZIM of https://kiwix.org, the Youtube video on the home page...
youzim.it run of https://archives.nyphil.org/ failed reporting lots of unrecognized chars. Task is [here](https://farm.youzim.it/pipeline/3cd41b6b-2d81-4acb-8948-a6820c5fa07f). Command used: ``` zimit --url=https://archives.nyphil.org/ --name=archives.nyphil.org_67aad441 --zim-file=archives.nyphil.org_67aad441.zim --userAgentSuffix=Youzim.it+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json [email protected] ``` Final error: ```...
This issue serves as a checklist for the release event. - [ ] ~Check that dependencies have been updated to latest version (especially python-scraper lib)~ - [ ] ~Adjust version...