warcio
warcio copied to clipboard
warcio recompress adds WARC-Block-Digest fields to records without one
It appears that warcio recompress will add WARC-Block-Digest fields to records that do not already have that field.
In the ZIP there are 2 warcs. example-warcs.zip
In orig.warc the warcinfo record at the start does not have a WARC-Block-Digest field at all. However if you run:
warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz
And look at warc-recompress.warc you will see that the warcinfo record now has WARC-Block-Digest with a SHA1 hash. (I included a copy of warc-recompress.warc in the ZIP).
While I suppose more digests aren't a bad thing:
- I would not expect a recompression operation to alter the records in the WARC.
- This behavior isn't documented
- It (very slightly) increases the size of the WARC
My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.