benoit74
benoit74
See https://github.com/webrecorder/browsertrix-crawler/issues/630
See https://github.com/openzim/zim-requests/issues/1059
In https://github.com/openzim/warc2zim/pull/306, we change the way we detected the type of content and introduced a warning intended to help to diagnose potential issues with this significant change. Once this has...
Currently, non-GET (POST, PUT, ...) requests returning an HTML document are supposed to work but they are not tested at all. It is supposed to work based on what has...
Currently, JSONP support is not tested at all. It is supposed to work based on what has been transferred from wabac.js, but not tested. We need to : - create...
Do we want to raise a warning in the logs (or fail the scraper?) when we have two WARC records leading to the same ZIM Path, most probably due to...
Currently, warc2zim is very permissive regarding issues that may arise while rewriting documents. This is mostly mandatory due to - the nature of website encountered in the wild which are...
Fix #370 Changes: - use a generic class to automatically compute function signature at rule initialization - use cached value at "runtime"
For a very small WARC like https://github.com/openzim/warc2zim/blob/main/tests/data-special/qsl.net-encoding-alias.warc.gz, it takes more than 2 minutes to build the ZIM. A flamegraph shows that most of the time is spent in the `rewrite_html`...
See https://data.fs.usda.gov/geodata/rastergateway/states-regions/states.php ``` ``` Seen on https://farm.zimit.kiwix.org/pipeline/f1a1f927-a785-4f8f-b0c6-c7d69e75ed14/debug ``` Traceback (most recent call last): File "/usr/bin/zimit", line 8, in sys.exit(zimit.zimit()) ^^^^^^^^^^^^^ File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line 585, in zimit run(sys.argv[1:]) File "/app/zimit/lib/python3.12/site-packages/zimit/zimit.py", line...