Zeno
Zeno copied to clipboard
XML extractor gets triggered on HTML page
URL.GetMIMEType() in IsXML appears to be set to text/xml; charset=utf-8 on http://laborculture.org. This is incorrect based on the headers and content.
time=2025-05-21T18:39:48.614-04:00 level=INFO msg="url archived" worker_id=0 component=archiver.archive url=http://laborculture.org/ seed_id=03b36 item_id=03b36 depth=0 hops=0 status=200
time=2025-05-21T18:39:48.614-04:00 level=ERROR msg="unable to extract assets" component=postprocessor.extractAssets err="xml: encoding \"ISO-8859-1\" declared but Decoder.CharsetReader is nil" item=03b36
time=2025-05-21T18:39:48.614-04:00 level=ERROR msg="unable to extract assets" component=postprocessor.postprocess.postprocessItem err="xml: encoding \"ISO-8859-1\" declared but Decoder.CharsetReader is nil" item_id=03b36
http://laborculture.org server is behaving a bit odd, sometimes the response appears to be invalid gzip data, and sometimes it's good. Further investigation is needed.
yzqzss@yzqzss-WU14 ~ [61]> curl http://laborculture.org/ --compressed
curl: (61) Error while processing content unencoding: incorrect header check
yzqzss@yzqzss-WU14 ~ [61]> curl http://laborculture.org/ --compressed --head
HTTP/1.1 200 OK
Cache-Control: private
Content-Length: 6830
Content-Type: text/html
Content-Encoding: gzip
Last-Modified: Wed, 09 Jun 2021 20:59:03 GMT
Accept-Ranges: bytes
ETag: "80c5647725dd71:0"
Vary: Accept-Encoding
Server: Microsoft-IIS/10.0, IIS111P
X-Powered-By: ASP.NET
Pool: 111
Date: Thu, 22 May 2025 08:32:14 GMT
yzqzss@yzqzss-WU14 ~> curl http://laborculture.org/ --compressed --raw --silent | xxd -
00000000: 1fef bfbd 0800 0000 0000 0400 efbf bd5b ...............[
00000010: 6b73 dbb6 efbf bdef bfbd 6cef bfbd 0aef ks........l.....
00000020: bfbd efbf bd6b 4bef bfbd 444a efbf bd25 .....kK...DJ...%
00000030: 7124 efbf bd4d efbf bdef bfbd d49d efbf q$...M..........
00000040: bdef bfbd efbf bd32 3967 2cef bfbd 03ef .......29g,.....
... skip ...
Can you still able to open it with browser?
I think may we could archive this site by disable the Accept-Encoding: gzip request header.
https://github.com/internetarchive/gowarc/blob/4a5d176aacd1246cb2a52f8974eecb3d7ee7b1e2/transport.go#L18