warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Add guidelines to describe recording WARC-level provenance?

Open anjackson opened this issue 11 years ago • 3 comments

It seems that there is a desire to record provenance of WARC files, e.g. in the case of concatenation. See http://ws-dl.blogspot.co.uk/2014/09/2014-09-02-warcmerge-merging-multiple.html

That proposal uses warcinfo records, and it is not clear that that is the right approach, as the concatenation event applies to the whole WARC, and warcinfo applies to subsequent records. Also, modifying the warcinfo records seems to contradict the purpose somewhat, as it modifies an existing provenance record and so changes its hash.

Anyway, this is just a reminder to consider adding a guideline on this issue. It presumably also applies if some other operation is performed on the WARCs that changes them.

anjackson avatar Sep 17 '14 13:09 anjackson

The WARC spec (as I understand it) assumes that the WARC in and of itself is not an artifact, but that it contains artifacts. Thus recording the provenance is counter-indicated.

This should either be stated explicitly in the spec or a new warcprovenance record could be added to explicitly deal with this. Personally, I'm in favor of the former until presented with a compelling argument as to why it is necessary to record these WARC transformations.

kris-sigur avatar Sep 17 '14 13:09 kris-sigur

Does the WARC file itself even have an ID? I guess I'm not sure how much it would take to make this unambiguous, particularly for sequences of concatenations.

I agree we should provide a concrete guideline that indicates that this kind of provenance information should be recorded out-of-band, at least for now. I just can't see a way of meaningfully including it, but am happy to consider proposals such as a warcprovenance record.

anjackson avatar Sep 17 '14 13:09 anjackson

May want to consider chapter 5.4 in the WARC implementation guidelines (see issue #2)

kris-sigur avatar Sep 19 '14 11:09 kris-sigur