warc2zim
warc2zim copied to clipboard
Failure of `iranwire.com`: impossible to decode content
https://github.com/openzim/zim-requests/issues/831 is not working with an issue which seems close to #186 but not exactly identical (JS file, first bytes of content seems pretty ok).
Traceback (most recent call last):
File "/usr/bin/zimit", line 8, in <module>
sys.exit(zimit.zimit())
^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 611, in zimit
run(sys.argv[1:])
File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 509, in run
return warc2zim(warc2zim_args)
^^^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/main.py", line 90, in main
return converter.run()
^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 264, in run
self.add_items_for_warc_record(record)
File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 493, in add_items_for_warc_record
payload_item = WARCPayloadItem(
^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/items.py", line 38, in __init__
(self.title, self.content) = Rewriter(path, record, known_urls).rewrite(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 50, in rewrite
return self.rewrite_js(opts)
^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 91, in rewrite_js
rewriter.rewrite(to_string(self.content, self.encoding), opts),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/zimit/lib/python3.11/site-packages/warc2zim/utils.py", line 111, in to_string
raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'!function(u){function e(e){for(var t,n,r=e[0],i=e[1],o=e[2],a=0,s=[];a<r.length;a++)n=r[a],Object.prototype.hasOwnProperty.call(c,n)&&c[n]&&s.push(c[n][0]),c[n]=0;for(t in i)Object.prototype.hasOwnPro'
Issue still here on https://farm.openzim.org/pipeline/fb2c30d2-6eb3-4e9a-a09c-843e3b44bd57/debug and WARC file produced, I will be able to debug!
Issue also present bbc persian (which did not really complete, disk utilization threshold have been reached ...) on https://farm.openzim.org/pipeline/214350a9-8e1b-40f8-aca2-e5fee2727661/debug and WARC file produced!
Should we simply restart the recipe to have the WARC en move forward?
Did you read my last comments? We have the WARC files for both iranwire.com and bbc persian which are affected by the issue. Just need time to work on this.
Did you read my last comments? We have the WARC files for both iranwire.com and bbc persian which are affected by the issue. Just need time to work on this.
My remark mostly concern the BBC one, where WARC file is incomplete. Lets try to have full WARC files to avoid later latency. This was the idea behind my remark.
Oh, right. I don't mind to restart BBC recipe, done.
Not reproduced with current warc2zim codebase and same WARC file. I don't know where this originates from. Can't repro. Closing for now.