warc-specifications icon indicating copy to clipboard operation
warc-specifications copied to clipboard

Align digest-value grammar with base16/32/64 alphabets

Open wumpus opened this issue 7 years ago • 6 comments

1.0 and 1.1 specify

labelled-digest = algorithm ":" digest-value

and digest-value is a token. "/" and "=" are not valid characters for a token. "/" is in the usual base64 encoding, and "=" is commonly used for padding.

wumpus avatar Nov 26 '18 23:11 wumpus

Good catch. While the examples and most implementations use base32 (which doesn't include "/") the padding character for base32 is also "=" so it's indeed a problem there too.

@wumpus, so that we can turn this issue into a change proposal for WARC 1.2 is there a better definition for digest-value you'd like to propose?

ato avatar Nov 27 '18 01:11 ato

https://tools.ietf.org/html/rfc4648 is kind of hand-waving but the union of all of the recommended schemes is

A-Za-z0-9/+-_=

Percent encoding is mentioned once and ~. are mentioned but are argued against, so it's not clear if they are allowed or not. It's as if the RFC was written to be non-normative.

wumpus avatar Nov 27 '18 16:11 wumpus

This is also a 1.0/1.1 erratum, not just a proposal for the future.

wumpus avatar Feb 03 '19 00:02 wumpus

This issue should be labeled with the "WARC/1.1-possible-errata" label @ato

wumpus avatar Nov 05 '19 07:11 wumpus

Ah yes, good point

ato avatar Nov 05 '19 07:11 ato

Given the issue noted in issue #80 with determining how is the digest encoded, shouldn't the specification be changed into something like labelled-digest = algorithm ":" encoding ":" digest-value? With suitable definitions for algorithm and encoding?

ljdarj avatar Sep 21 '24 17:09 ljdarj