openwayback
openwayback copied to clipboard
Deduplication & "Not Modified" WARC Records
When crawling using Heritrix, if both sendIfModifiedSince
and writeRevisitForNotModified
are set to true
(although the latter has been deprecated, presumably equivalent to always being true
), a server may respond with an empty response and a WARC record like the following can be written (taken from the warc-specification project):
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://www.bl.uk/
WARC-Date: 2014-11-24T08:13:54Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 91.194.151.38
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/server-not-modified
WARC-Truncated: length
WARC-Etag: "4078134-aed6-6117a140"
WARC-Record-ID: <urn:uuid:d41c9044-fad4-402a-bdc8-ff6c63d0f419>
Content-Length: 0
Here the WARC-Payload-Digest
has been calculated on the empty, zero-length content. As a result, it won't match that of the earlier record and OpenWayback won't find the original payload.
The WARC spec. does say that:
For records using this profile, the payload is defined as the original payload content from which a 'LastModified' and/or 'ETag' value was taken.
Whether this means that the WARC-Payload-Digest
should be calculated on revisited record, I'm not sure. However, the above is a live, written WARC so we should probably figure out how to handle such things.
This is a Heritrix bug.
From the WARC spec, chapter 5.9 on WARC-Payload-Digest:
An optional parameter indicating the algorithm name and calculated value of a digest applied to the payload referred to or contained by the record - which is not necessarily equivalent to the record block
(emphasis is mine)
This clearly means that in server-not-modified revisit records, this field should either be omitted or be equal to the original record.
I do wonder if OpenWayback can gracefully handle the absence of the digest? Presumably it should if original URL and date are provided?
If they are then yes, I think that should work. Worth building in a test case anyway.
In the case of the above I'm wondering whether we should attempt to handle this despite the fact it's non-compliant? Alternatively, we could give an example of a way to work around it outside OpenWayback (I'm thinking of a script to create 'dummy' CDX lines for revisits with no matching response).
We could probably detect empty payload digests (should always have the same value) and process as if there wasn't any digest.
In the absence of original URI and/or date, that would mean using the latest "previous" capture that isn't a revisit.
This relates to #117