benoit74
benoit74
Probably following the upgrade to zimscraperlib 5, it is not possible anymore to pass multiple languages as CSV: ``` Traceback (most recent call last): File "/usr/bin/zimit", line 8, in sys.exit(zimit.zimit())...
Currently, when favicon used as illustration is not proper size, we resize it. In fact this does not reduce the file size. We should call `optimize_png` (available in scraperlib) after...
On some occasions, we have recipes which takes a lot of time to process warc2zim. For instance https://farm.openzim.org/pipeline/466196d7-aa93-40cd-aec4-d8fb49294255: - browsertrix crawler started at 2025-03-03 20:52:24 - warc2zim started at 2025-03-14...
In JS rewriting, we have a `replace_this_non_prop` rule to for instance transform: - `a = this;` into `a = _____WB$wombat$check$this$function_____(this)` - `return this.location` into `return _____WB$wombat$check$this$function_____(this).location` and so on. There...
See https://farm.zimit.kiwix.org/pipeline/e0d6a925-1892-4306-a6cf-b71791d23e42/debug Why it is "famous" that some websites are giving improper encoding, it is weird to have "None" encoding. To be analyzed. Web page with the problem: https://www.highlandwoodworking.com/finishing/wood-finishing-color-triangle.html
In some cases (e.g. https://github.com/openzim/zim-requests/issues/1162, but I'm pretty sure https://github.com/openzim/warc2zim/issues/402 would need the same), we need to patch website JS so that it does not interfere badly once inside the...
Task: https://farm.zimit.kiwix.org/pipeline/d5d36f11-fdf0-4fa8-a078-99a46b2250aa/debug command: ``` zimit --url=https://istorija.haroldas.net --name=istorija.haroldas.net_d9cf9925 --zim-file=istorija.haroldas.net_d9cf9925.zim --userAgentSuffix=zimit.kiwix.org+ --sizeLimit=4294967296 --timeLimit=7200 --output=/output --statsFilename=/output/task_progress.json [email protected] --keep --publisher=openZIM ``` stdout: ``` [warc2zim::2024-12-29 23:24:49,350] ERROR:Problem encountered while processing https://istorija.haroldas.net/?zip=storage. Traceback (most recent call...
https://farm.zimit.kiwix.org/pipeline/063787bf-02ba-4cee-9d62-4a024f883967/debug ``` ).rewrite(self.content_str) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "", line 32, in rewrite File "/app/zimit/lib/python3.12/site-packages/zimscraperlib/rewriting/html.py", line 165, in rewrite self.close() File "/usr/lib/python3.12/html/parser.py", line 115, in close self.goahead(1) File "/usr/lib/python3.12/html/parser.py", line 179, in goahead...
Task: https://farm.zimit.kiwix.org/pipeline/bb7f1afd-c1b3-4f26-bada-a5ea067cd6d4/debug Crawl was interrupted after 2 hours as expected. Only 390 pages have been crawled. However, I had to manually stop warc2zim because it was still processing after about...