Squidwarc icon indicating copy to clipboard operation
Squidwarc copied to clipboard

Original Response headers (i.e., start with X-Archive-Orig-...) are modified

Open maturban opened this issue 7 years ago • 1 comments

Are you submitting a bug report or a feature request?

A bug report.

What is the current behavior?

Generate a WARC file for https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ .

What is the expected behavior?

The Response headers of requesting https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/ should be as following:

Content-Encoding: gzip
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-content-length: 11603
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

But we got:

Date: Mon, 14 Aug 2017 03:41:43 GMT
X-App-Server: wwwb-app42
X-location: All
Transfer-Encoding: chunked
X-Archive-Playback: 0
X-Archive-Orig-vary: Accept-Encoding
Memento-Datetime: Wed, 05 Jul 2017 23:51:34 GMT
X-ts: ----
X-Archive-Orig-server: nginx
Server: Tengine/2.1.0
X-Archive-Guessed-Charset: utf-8
Content-Type: text/html; charset=utf-8
Connection: keep-alive
X-Page-Cache: MISS
X-Archive-Orig-connection: close
X-Archive-Orig-date: Wed, 05 Jul 2017 23:51:39 GMT
X-Archive-Orig-Content-Length: 22495
Link: <http://www.cs.odu.edu/~maturban/>; rel="original", <https://web.archive.org/web/timemap/link/http://www.cs.odu.edu/~maturban/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://www.cs.odu.edu/~maturban/>; rel="timegate", <https://web.archive.org/web/20140917205517/http://www.cs.odu.edu/~maturban/>; rel="first memento"; datetime="Wed, 17 Sep 2014 20:55:17 GMT", <https://web.archive.org/web/20170614104612/http://www.cs.odu.edu/~maturban/>; rel="prev memento"; datetime="Wed, 14 Jun 2017 10:46:12 GMT", <https://web.archive.org/web/20170705235134/http://www.cs.odu.edu/~maturban/>; rel="memento"; datetime="Wed, 05 Jul 2017 23:51:34 GMT", <https://web.archive.org/web/20170710100858/http://www.cs.odu.edu/~maturban>; rel="next memento"; datetime="Mon, 10 Jul 2017 10:08:58 GMT", <https://web.archive.org/web/20170710100917/http://www.cs.odu.edu/~maturban/>; rel="last memento"; datetime="Mon, 10 Jul 2017 10:09:17 GMT"

The issue is that the value of one of the original Response headers (i.e., X-Archive-Orig-content-length) has been changed from 11603 to 22495. In general, I think all original Response headers (i.e., start with "X-Archive-Orig-...") should not be modified.

What's your environment?

macOS Sierra

Other information

I think the issue is from the following lines of code: File: .../node-modules/node-warc/lib/writers/remoteChrome.js Lines: 767 and 768 The code:

          responseHeaders = responseHeaders.replace(noGZ, '')
          responseHeaders = responseHeaders.replace(replaceContentLen, `Content-Length: ${Buffer.byteLength(resData, 'utf8')}${CRLF}`)

maturban avatar Aug 14 '17 04:08 maturban

@maturban

Thank you for pointing that out I really should go ahead and add re-gz or re-defleate via zlib rather than tightening up the regex used.....

One for the node-warc issue

N0taN3rd avatar Aug 14 '17 04:08 N0taN3rd