multibase icon indicating copy to clipboard operation
multibase copied to clipboard

Base 2, base 8, and base 10

Open Stebalien opened this issue 5 years ago • 10 comments

The odd-balls in the current multibase spec are:

  • Base 2
  • Base 8
  • Base 10

That is, these are generally considered less useful than the other bases. The current situation is:

  • Base 2 is useful for bitfields.
  • One of base 8 or base 10 may be useful when only digits (0-9) are allowed.
    • Base 10 has a spec.
    • Base 10 is a more compact.
    • Base 8 may be simpler to decode/encode.

The question is: which of these should we keep, if any? This is relevant to https://github.com/multiformats/go-multibase/pull/26 as, if we keep base 8, we need to define and implement it.

Stebalien avatar Aug 01 '19 17:08 Stebalien

I'm in favour of reducing the burden on implementers. If it turns out that there's a base encoding that isn't part of the spec yet, we can add it later on. I'm for starting with a valuable small set of things and expand if needed (which might never be needed).

vmx avatar Aug 02 '19 10:08 vmx

@vmx, as a C# implementor, I unilaterally decided not to implement these bases. See https://github.com/richardschneider/net-ipfs-core/issues/54

richardschneider avatar Aug 05 '19 06:08 richardschneider

Base8 can encode/decode more efficiently. (Computationally efficient for large data) Base10 uses less space but is more expensive to encode/decode. (Space efficient)

I would say both should be kept and a Base8 spec added.

fabianhjr avatar Oct 02 '19 03:10 fabianhjr

Note for those following along. While go-multibase never gained a base8 encoding implementation, js-multibase is about to get fully-baked support for this. Notably from https://github.com/multiformats/js-multibase/pull/55#issue-427355352

Note: base8 deviates from the spec tests outputs but aligns with multiformats/multibase#60

We should really make a decision here, and at least fix the shared-test-vectors to include only parts we expect implementations to support.

For easy-to-eyeball reference the current path taken by js-ipfs is: https://github.com/multiformats/js-multibase/blob/c8f762996e47403c0c41c4f16c35c7b252c4f31e/src/constants.js#L14-L39

refs:

  • Stalled Base-8 spec: https://github.com/multiformats/multibase/pull/60
  • Stalled go-multibase Base-8 implementaion: https://github.com/multiformats/go-multibase/pull/26
  • C# decision: https://github.com/richardschneider/net-ipfs-core/issues/54
  • Rust: ???

/cc @vmx @rvagg @lidel @hugomrdias @creationix

ribasushi avatar Jun 03 '20 18:06 ribasushi

Are there actually use cases where only decimal digits but a large or arbitrary number of digits is allowed?

I know of lots of places that store integers, but those have limits on size typically 32 or 64 bits which is way too small for hashes.

creationix avatar Jun 03 '20 18:06 creationix

Same question as @creationix -- I find these bases only interesting for academic purposes and would love to know what real-world use-cases there might be, are we just doing completeness for completeness' sake?

Regarding the specific question, +1 on adopting what JS is doing now. The approach to base8 is consistent with the other bases so I think the change is correct and the test fixtures should change.

rvagg avatar Jun 04 '20 04:06 rvagg

I think I have some time to move #60 along. sorry for stalling Q_Q

fabianhjr avatar Jun 04 '20 21:06 fabianhjr

I think I have some time to move #60 along. sorry for stalling Q_Q

No worries @fabianhjr! May I suggest pivoting a bit and reframing the spec into a generic "rfc4648-derived" spec covering base8, base16, base32 and base64? This way you can both abstractly define padded/non-padded variants and we can still get away by defining just the types we want implementations to support.

As a logical step 2 the base36 spec could be reworked into "base-X spec" to define base10,base36 and base58.

This code-block puts in perspective what I mean by "let's just have 2 generic specs": https://github.com/multiformats/js-multibase/blob/c8f762996e47403c0c41c4f16c35c7b252c4f31e/src/constants.js#L14-L39

ribasushi avatar Jun 04 '20 21:06 ribasushi

@ribasushi, pushed some changes to leave the simple mapping and mention it as RFC4648 derived.

fabianhjr avatar Jun 04 '20 21:06 fabianhjr

Given that base2, base8 and base10 and base16 are all common bases for number literals, it would be good to have common behaviour when decoding non-canonical strings. As far as I understand, things currently stand as follows:

  • base2, base10 explicitly preserve leading zeros and encode/decode the trailing data;
  • base8 drops the last incomplete word;
  • base16 is somewhat ambiguous, because rfc4648 Section 3.5 does not mandate a specific behaviour for decoders.

In my opinion, the expected behaviour for these encodings should be the same: preserve leading zeros, then encode/decode using the given base (and choice of alphabet for digits). This is effectively the same as zero-padding bits to the left, and is the same behaviour as base36 and base58.

However, base16 is described by rfc4648 Section 8 as being analogous to base32 and base64, and the latter both mandate zero-padding of bits to the right when necessary to complete a bit group in encoding. Furthermore, rfc4648 Section 3.5 mentions that encodings with non-zero bit padding MAY be rejected by decoders, from which one might deduce that the intended behaviour for decoders was also to consider zero-padding of bits to the right. However, this is not explicitly mandated, as far as I understand.

The above suggests three possible choices when decoding odd-length strings in base16:

  • reject them, as done by the the base64 module of Python;
  • zero-pad them to the right, as rfc4648 might seem to indicate;
  • zero-pad them to the left, analogously to base2 and base10 (which should also be what base8 does, IMHO).

sg495 avatar Oct 15 '21 11:10 sg495