benoit74
benoit74
See https://github.com/openzim/warc2zim/pull/218#issuecomment-2020609892 for details. Basically we probably need to: - merge the `indexed_urls` and `existing_zim_paths` into a single dictionary `zim_entries_created` where the key is the ZIM path and the value...
`WARCPayloadItem` and `Rewriter`s have a `path` member which is a string. In fact, it should be kept as a `ZimPath` in these classes as well for code clarity.
Scraper must validate metadata and fail as early as possible (especially when called without WARCs to validate inputs).
Zimit2: Document what is known (or supposed) to work and known limitations (or not tested at least)
It is becoming more and more important to document what is known to work and what is not to not be supported (yet or at all). I propose a first...
Task: https://farm.youzim.it/pipeline/242c7e50-dac4-4cfe-bd15-92af8ef003ba/debug Logs: ``` Processing WARC files in /output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive 16 WARC files found Calling warc2zim with these args: ['--name=developer.android.com_095ac3f0', '--zim-file=developer.android.com_095ac3f0.zim', '--publisher=openZIM', '--output', '/output', '--url', 'https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive'] [DEBUG]...
https://github.com/openzim/zim-requests/issues/831 is not working with an issue which seems close to #186 but not exactly identical (JS file, first bytes of content seems pretty ok). ``` Traceback (most recent call...
See https://github.com/openzim/warc2zim/pull/158#discussion_r1467770641 for details.
It is possible to redirect a page with a `meta http-equiv`: ``` ``` See https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#http-equiv So far, Zimit2 does not rewrite these links which are hence not leading to content...
The scraper has fuzzy rules coming from wabac.js fuzzy matcher: https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js These rules have been updated at the source but changes have not been back-ported. One significant thing I noticed...
It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files. It already did so for Zimit1, but maybe it is time to address the problems. The...