benoit74

Results 370 issues of benoit74

See https://github.com/openzim/warc2zim/pull/218#issuecomment-2020609892 for details. Basically we probably need to: - merge the `indexed_urls` and `existing_zim_paths` into a single dictionary `zim_entries_created` where the key is the ZIM path and the value...

bug
question

`WARCPayloadItem` and `Rewriter`s have a `path` member which is a string. In fact, it should be kept as a `ZimPath` in these classes as well for code clarity.

enhancement

Scraper must validate metadata and fail as early as possible (especially when called without WARCs to validate inputs).

enhancement

It is becoming more and more important to document what is known to work and what is not to not be supported (yet or at all). I propose a first...

documentation

Task: https://farm.youzim.it/pipeline/242c7e50-dac4-4cfe-bd15-92af8ef003ba/debug Logs: ``` Processing WARC files in /output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive 16 WARC files found Calling warc2zim with these args: ['--name=developer.android.com_095ac3f0', '--zim-file=developer.android.com_095ac3f0.zim', '--publisher=openZIM', '--output', '/output', '--url', 'https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive'] [DEBUG]...

bug

https://github.com/openzim/zim-requests/issues/831 is not working with an issue which seems close to #186 but not exactly identical (JS file, first bytes of content seems pretty ok). ``` Traceback (most recent call...

bug
recipe

See https://github.com/openzim/warc2zim/pull/158#discussion_r1467770641 for details.

question

It is possible to redirect a page with a `meta http-equiv`: ``` ``` See https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#http-equiv So far, Zimit2 does not rewrite these links which are hence not leading to content...

enhancement

The scraper has fuzzy rules coming from wabac.js fuzzy matcher: https://github.com/webrecorder/wabac.js/blob/main/src/fuzzymatcher.js These rules have been updated at the source but changes have not been back-ported. One significant thing I noticed...

enhancement

It looks like Zimcheck is complaining about quality issues in most (all?) Zimit2 files. It already did so for Zimit1, but maybe it is time to address the problems. The...

bug
enhancement
question