openwayback icon indicating copy to clipboard operation
openwayback copied to clipboard

Deduplication & "Not Modified" WARC Records

Open PsypherPunk opened this issue 9 years ago • 5 comments

When crawling using Heritrix, if both sendIfModifiedSince and writeRevisitForNotModified are set to true (although the latter has been deprecated, presumably equivalent to always being true), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.bl.uk/
WARC-Date: 2014-11-24T08:13:54Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.194.151.38
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Truncated: length
WARC-Etag: "4078134-aed6-6117a140"
WARC-Record-ID: <urn:uuid:d41c9044-fad4-402a-bdc8-ff6c63d0f419>
Content-Length: 0

Here the WARC-Payload-Digest has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.

The WARC spec. does say that:

For records using this profile, the payload is defined as the original payload content from which a 'LastModified' and/or 'ETag' value was taken.

Whether this means that the WARC-Payload-Digest should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.

PsypherPunk avatar Mar 11 '15 12:03 PsypherPunk

This is a Heritrix bug.

From the WARC spec, chapter 5.9 on WARC-Payload-Digest:

An optional parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record - which is not necessarily equivalent to the record block

(emphasis is mine)

This clearly means that in server-not-modified revisit records, this field should either be omitted or be equal to the original record.

kris-sigur avatar Mar 11 '15 12:03 kris-sigur

I do wonder if OpenWayback can gracefully handle the absence of the digest? Presumably it should if original URL and date are provided?

kris-sigur avatar Mar 11 '15 13:03 kris-sigur

If they are then yes, I think that should work. Worth building in a test case anyway.

In the case of the above I'm wondering whether we should attempt to handle this despite the fact it's non-compliant? Alternatively, we could give an example of a way to work around it outside OpenWayback (I'm thinking of a script to create 'dummy' CDX lines for revisits with no matching response).

PsypherPunk avatar Mar 11 '15 14:03 PsypherPunk

We could probably detect empty payload digests (should always have the same value) and process as if there wasn't any digest.

In the absence of original URI and/or date, that would mean using the latest "previous" capture that isn't a revisit.

kris-sigur avatar Mar 11 '15 15:03 kris-sigur

This relates to #117

kris-sigur avatar Mar 13 '15 13:03 kris-sigur