warc2zim icon indicating copy to clipboard operation
warc2zim copied to clipboard

Failure of `iranwire.com`: impossible to decode content

Open benoit74 opened this issue 1 year ago • 6 comments

https://github.com/openzim/zim-requests/issues/831 is not working with an issue which seems close to #186 but not exactly identical (JS file, first bytes of content seems pretty ok).

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 611, in zimit
    run(sys.argv[1:])
  File "/app/zimit/lib/python3.11/site-packages/zimit/zimit.py", line 509, in run
    return warc2zim(warc2zim_args)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/main.py", line 90, in main
    return converter.run()
           ^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 264, in run
    self.add_items_for_warc_record(record)
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/converter.py", line 493, in add_items_for_warc_record
    payload_item = WARCPayloadItem(
                   ^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/items.py", line 38, in __init__
    (self.title, self.content) = Rewriter(path, record, known_urls).rewrite(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 50, in rewrite
    return self.rewrite_js(opts)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/content_rewriting/generic.py", line 91, in rewrite_js
    rewriter.rewrite(to_string(self.content, self.encoding), opts),
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.11/site-packages/warc2zim/utils.py", line 111, in to_string
    raise ValueError(f"Impossible to decode content {input_[:200]}")
ValueError: Impossible to decode content b'!function(u){function e(e){for(var t,n,r=e[0],i=e[1],o=e[2],a=0,s=[];a<r.length;a++)n=r[a],Object.prototype.hasOwnProperty.call(c,n)&&c[n]&&s.push(c[n][0]),c[n]=0;for(t in i)Object.prototype.hasOwnPro'

benoit74 avatar Feb 19 '24 10:02 benoit74

Issue still here on https://farm.openzim.org/pipeline/fb2c30d2-6eb3-4e9a-a09c-843e3b44bd57/debug and WARC file produced, I will be able to debug!

benoit74 avatar Mar 07 '24 09:03 benoit74

Issue also present bbc persian (which did not really complete, disk utilization threshold have been reached ...) on https://farm.openzim.org/pipeline/214350a9-8e1b-40f8-aca2-e5fee2727661/debug and WARC file produced!

benoit74 avatar Mar 11 '24 07:03 benoit74

Should we simply restart the recipe to have the WARC en move forward?

kelson42 avatar Mar 19 '24 07:03 kelson42

Did you read my last comments? We have the WARC files for both iranwire.com and bbc persian which are affected by the issue. Just need time to work on this.

benoit74 avatar Mar 19 '24 07:03 benoit74

Did you read my last comments? We have the WARC files for both iranwire.com and bbc persian which are affected by the issue. Just need time to work on this.

My remark mostly concern the BBC one, where WARC file is incomplete. Lets try to have full WARC files to avoid later latency. This was the idea behind my remark.

kelson42 avatar Mar 19 '24 07:03 kelson42

Oh, right. I don't mind to restart BBC recipe, done.

benoit74 avatar Mar 19 '24 07:03 benoit74

Not reproduced with current warc2zim codebase and same WARC file. I don't know where this originates from. Can't repro. Closing for now.

benoit74 avatar May 16 '24 15:05 benoit74