Squidwarc
Squidwarc copied to clipboard
Original Response headers (i.e., start with X-Archive-Orig-...) are modified
Are you submitting a bug report or a feature request?
A bug report.
What is the current behavior?
Generate a WARC file for https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ .
What is the expected behavior?
The Response headers of requesting https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ should be as following:
Content-Encoding: gzip
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-content-length: 11603
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"
But we got:
Date: Mon, 14 Aug 2017 03:41:43 GMT
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-Content-Length: 22495
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"
The issue is that the value of one of the original Response headers (i.e., X-Archive-Orig-content-length) has been changed from 11603 to 22495. In general, I think all original Response headers (i.e., start with "X-Archive-Orig-...") should not be modified.
What's your environment?
macOS Sierra
Other information
I think the issue is from the following lines of code:
File: .../node-modules/node-warc/lib/writers/remoteChrome.js
Lines: 767 and 768
The code:
responseHeaders = responseHeaders.replace(noGZ, '')
responseHeaders = responseHeaders.replace(replaceContentLen, `Content-Length: ${Buffer.byteLength(resData, 'utf8')}${CRLF}`)
@maturban
Thank you for pointing that out I really should go ahead and add re-gz or re-defleate via zlib rather than tightening up the regex used.....
One for the node-warc issue