browsertrix-crawler
browsertrix-crawler copied to clipboard
Config payload Digest sha1 base32
I couldn't find a setting to configure the Digest to sha1 base32 as our entirley archive (even ARC!) contains Digest with sha1 base32 Actually it is set to sha256 hex We face problems in the deduplication with the Digest as sha256 hex, as in the CDX is base32 sha1 used, it is not possible to use the CDX for deduplication without regenerating the entire Index.
For us the most easy solution would be to make it configureable as parameter (--digest-encoding string, possilities: base16, base32, base64 and one of them as default (for us base32 would be grat as default) ) see also https://datatracker.ietf.org/doc/html/rfc3548
The Version 0.12.4 was using sha1 base32 Version 1.0.2 is now using sha256 base16
Hi @gitreich - putting this on our sprint board to look into after IIPC WAC :)
Hi; At the WAC24 @ikreymer brought up the idea to make a parameter for adding the location of the CDXIndex (for DeDup via writing revisit entries) If this feature would come, this issue here could be also handled via a CDXParameter: Read Out of the given CDX the payload digest format and continue writing into the new generated WARCs with the Digest found in the given Index