warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Unable to find WARC record for main page

Open benoit74 opened this issue 11 months ago • 1 comments

Task: https://farm.youzim.it/pipeline/242c7e50-dac4-4cfe-bd15-92af8ef003ba/debug

Logs:

Processing WARC files in /output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive
16 WARC files found
Calling warc2zim with these args: ['--name=developer.android.com_095ac3f0', '--zim-file=developer.android.com_095ac3f0.zim', '--publisher=openZIM', '--output', '/output', '--url', 'https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmpo2d1azgz/collections/crawl-20240226094351919/archive']
[DEBUG] Confirming output is writable using /output/tmpdlm1wput
Traceback (most recent call last):
  File "/usr/bin/zimit", line 566, in <module>
    zimit()
  File "/usr/bin/zimit", line 464, in zimit
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/main.py", line 113, in main
    return converter.run()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 231, in run
    self.find_main_page_metadata()
  File "/app/zimit/lib/python3.10/site-packages/warc2zim/converter.py", line 371, in find_main_page_metadata
    raise KeyError(
KeyError: 'Unable to find WARC record for main page: https://developer.android.com/?gclid=Cj0KCQiAwP6sBhDAARIsAPfK_wafUvQ9ZEyZvgEE17WFwZ3rZAnjF8P-2I7gUW8gbR8iGQezwc2euVsaAh72EALw_wcB&gclsrc=aw.ds, aborting'

Is it linked to the query parameter?

benoit74 avatar Mar 04 '24 14:03 benoit74

Again with https://edisciplinas.usp.br/pluginfile.php/4557662/mod_resource/content/1/CRC%20Handbook%20of%20Chemistry%20and%20Physics%2095th%20Edition.pdf as URL (single page ZIM with a single PDF file ... no really a promising thing, but why not ...)

benoit74 avatar Apr 04 '24 14:04 benoit74

Another occurrence with https://www.playmobil.com/fr-fr/tiny-house/71509.html?gad_source=1&gclid=CjwKCAjwuJ2xBhA3EiwAMVjkVK41oNKfKsuOcp6oXd4I1lLYXhgnB4PE3Yg8zSBMPb7jHvZEZbMdBRoCizIQAvD_BwE

benoit74 avatar Jun 03 '24 12:06 benoit74

Again with https://www.v2ph.net/album/z7nx385a.html?hl=en (but you should probably not go visit this website, you've been warned)

benoit74 avatar Jun 03 '24 13:06 benoit74

See also cases of https://github.com/openzim/zimit/issues/328 which confirms issue is still present in warc2zim 2

benoit74 avatar Jun 24 '24 11:06 benoit74

  • https://farm.zimit.kiwix.org/pipeline/25f26c89-7f92-4f6e-a53b-35ff2c854912/debug : tracked in https://github.com/openzim/warc2zim/issues/336
  • https://farm.zimit.kiwix.org/pipeline/bfa62cd6-36c1-445a-b110-5068e8566d96/debug : no repro, on my machine the crawler if fetching way more pages than only 1 as in this run
  • https://farm.zimit.kiwix.org/pipeline/97caa814-1a42-4a7c-b5f7-6db96ab845c9/debug and https://farm.zimit.kiwix.org/pipeline/6c8787cf-e6d4-475a-8631-552cb3bc6d10/debug and https://farm.zimit.kiwix.org/pipeline/8826288a-374a-4a38-b1fa-b8a8df3c963f/debug and https://www.v2ph.net/album/z7nx385a.html?hl=en : https://github.com/openzim/warc2zim/issues/337
  • https://farm.zimit.kiwix.org/pipeline/962a9881-54ab-40f9-af8f-527071857309/debug : no repro, on my machine the crawler is fetching exactly 218 pages just like on original crawl but a ZIM is produced succesfuly
  • single PDF as main entry: no repro, on my machine this scenario works well and a ZIM is produced and works well
  • playmobil website: https://github.com/openzim/zimit/issues/331

Other tasks are unfortunately already gone.

Closing this task since subtasks have been properly identified.

benoit74 avatar Jun 25 '24 14:06 benoit74