warc2zim
warc2zim copied to clipboard
Unable to find WARC record for main page
Task: https://farm.youzim.it/pipeline/242c7e50-dac4-4cfe-bd15-92af8ef003ba/debug
Logs:
Processing WARC files in /output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive
16 WARC files found
Calling warc2zim with these args: ['--name=developer.android.com_095ac3f0', '--zim-file=developer.android.com_095ac3f0.zim', '--publisher=openZIM', '--output', '/output', '--url', 'https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive']
[DEBUG] Confirming output is writable using /output/tmpdlm1wput
Traceback (most recent call last):
File "/usr/bin/zimit", line 566, in <module>
zimit()
File "/usr/bin/zimit", line 464, in zimit
return warc2zim(warc2zim_args)
File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
return converter.run()
File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 231, in run
self.find_main_page_metadata()
File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 371, in find_main_page_metadata
raise KeyError(
KeyError: 'Unable to find WARC record for main page: https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds, aborting'
Is it linked to the query parameter?
Again with https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf as URL (single page ZIM with a single PDF file ... no really a promising thing, but why not ...)
Another occurrence with https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1&gclid=CjwKCAjwuJ2xBhA3EiwAMVjkVK41oNKfKsuOcp6oXd4I1lLYXhgnB4PE3Yg8zSBMPb7jHvZEZbMdBRoCizIQAvD_BwE
Again with https://www.v2ph.net/album/z7nx385a.html?hl=en (but you should probably not go visit this website, you've been warned)
See also cases of https://github.com/openzim/zimit/issues/328 which confirms issue is still present in warc2zim 2
- https://farm.zimit.kiwix.org/pipeline/25f26c89-7f92-4f6e-a53b-35ff2c854912/debug : tracked in https://github.com/openzim/warc2zim/issues/336
- https://farm.zimit.kiwix.org/pipeline/bfa62cd6-36c1-445a-b110-5068e8566d96/debug : no repro, on my machine the crawler if fetching way more pages than only 1 as in this run
- https://farm.zimit.kiwix.org/pipeline/97caa814-1a42-4a7c-b5f7-6db96ab845c9/debug and https://farm.zimit.kiwix.org/pipeline/6c8787cf-e6d4-475a-8631-552cb3bc6d10/debug and https://farm.zimit.kiwix.org/pipeline/8826288a-374a-4a38-b1fa-b8a8df3c963f/debug and https://www.v2ph.net/album/z7nx385a.html?hl=en : https://github.com/openzim/warc2zim/issues/337
- https://farm.zimit.kiwix.org/pipeline/962a9881-54ab-40f9-af8f-527071857309/debug : no repro, on my machine the crawler is fetching exactly 218 pages just like on original crawl but a ZIM is produced succesfuly
- single PDF as main entry: no repro, on my machine this scenario works well and a ZIM is produced and works well
- playmobil website: https://github.com/openzim/zimit/issues/331
Other tasks are unfortunately already gone.
Closing this task since subtasks have been properly identified.