browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Config payload Digest sha1 base32

Open gitreich opened this issue 10 months ago • 2 comments

I couldn't find a setting to configure the Digest to sha1 base32 as our entirley archive (even ARC!) contains Digest with sha1 base32 Actually it is set to sha256 hex We face problems in the deduplication with the Digest as sha256 hex, as in the CDX is base32 sha1 used, it is not possible to use the CDX for deduplication without regenerating the entire Index.

For us the most easy solution would be to make it configureable as parameter (--digest-encoding string, possilities: base16, base32, base64 and one of them as default (for us base32 would be grat as default) ) see also https://datatracker.ietf.org/doc/html/rfc3548

The Version 0.12.4 was using sha1 base32 Version 1.0.2 is now using sha256 base16

gitreich avatar Apr 09 '24 11:04 gitreich

Hi @gitreich - putting this on our sprint board to look into after IIPC WAC :)

tw4l avatar Apr 22 '24 09:04 tw4l

Hi; At the WAC24 @ikreymer brought up the idea to make a parameter for adding the location of the CDXIndex (for DeDup via writing revisit entries) If this feature would come, this issue here could be also handled via a CDXParameter: Read Out of the given CDX the payload digest format and continue writing into the new generated WARCs with the Digest found in the given Index

gitreich avatar May 06 '24 10:05 gitreich