warcio icon indicating copy to clipboard operation
warcio copied to clipboard

warcio recompress adds WARC-Block-Digest fields to records without one

Open acidus99 opened this issue 1 year ago • 0 comments

It appears that warcio recompress will add WARC-Block-Digest fields to records that do not already have that field.

In the ZIP there are 2 warcs. example-warcs.zip

In orig.warc the warcinfo record at the start does not have a WARC-Block-Digest field at all. However if you run:

warcio recompress orig.warc warcio-recompress.warc.gz
gunzip warcio-recompress.warc.gz

And look at warc-recompress.warc you will see that the warcinfo record now has WARC-Block-Digest with a SHA1 hash. (I included a copy of warc-recompress.warc in the ZIP).

While I suppose more digests aren't a bad thing:

  • I would not expect a recompression operation to alter the records in the WARC.
  • This behavior isn't documented
  • It (very slightly) increases the size of the WARC

My suggestion would be that warcio recompress should not alter the records of the WARC it is operating on.

acidus99 avatar Jan 07 '24 23:01 acidus99