multibase Update test files

This pull request adds test vectors for Base 45 and Proquint.

Jun 05 '24 12:06 ben221199

wow, proquint too, I've been tempted for a long time to hack on that in JS but never got around to it

I'll verify, it'll depend on inspiration levels and my ability to pull myself away from more pressing things, plz be patient

Jun 06 '24 02:06 rvagg

I did read the spec and wrote some JS thing like this:

var CONSONANTS = "bdfghjklmnprstvz".split("");
var VOWELS     = "aiou".split("");

function encode(str){
    if(str.length>0 && str.length%2!=0){
        throw new Exception("Should be divisible by 16 bits");
    }
    var shorts = str.length/2;
    var parts = [];
    for(var i=0;i<shorts;i++){
        var value = str[i*2].charCodeAt(0)*256 + str[i*2+1].charCodeAt(0);
        parts.push(CONSONANTS[value>>12 & 0xF]+VOWELS[value>>10 & 0x3]+CONSONANTS[value>>6 & 0xF]+VOWELS[value>>4 & 0x3]+CONSONANTS[value>>0 & 0xF]);
    }
    return parts.join('-');
}

After that, I patched it a little, because it wasn't perfect. The * 256 was changed to << 8, the + became |, and the input string was changed to be an UInt8Array. Then it should work and be able to encode. Note that the input must always have an even amount of bytes. In the spec I didn't see anything that allowed any length, because that would cause the middle consonant sometimes to be split in halve, possibly giving the possibility to use multiple letters where the first two bits of the letter index are the same as the last two bits of the bytes. I don't think they want that.

Jun 06 '24 07:06 ben221199

My hand-rolled proquint implementation at https://github.com/multiformats/js-multiformats/pull/292 gives the following vectors:

Decentralize everything!! = pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fah
yes mani ! = pro-lojoj-lasob-kujod-kunon-fabod
hello world = pro-kodoj-kudos-kusob-litoz-lanos-kib
\x00yes mani ! = pro-badun-kijug-fadot-kajov-kohob-fah
\x00\x00yes mani ! = pro-babab-lojoj-lasob-kujod-kunon-fabod

Left some additional notes in https://github.com/multiformats/js-multiformats/pull/292, but other reference implementations that can do more than just 16 and 32 bit numbers would be nice to try against.

Jun 06 '24 13:06 rvagg

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

Jun 06 '24 13:06 rvagg

My outputs for those values (when odd, I added 0x00, else my implementation only prints p):

pro-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Same, but with ab)
pro-lojoj-lasob-kujod-kunon-fabod (Same)
pro-kodoj-kudos-kusob-litoz-lanos-kibab (Same, but with ab)
pro-badun-kijug-fadot-kajov-kohob-fahab (Same, but with ab)
pro-babab-lojoj-lasob-kujod-kunon-fabod (Same)

Seems your system works.

In order to support 8-bit chunks instead of the 16-bit chunks now, I think we should contact the original inventor of the thing. If we just do it ourselves, we break spec and possibly compatibility with other systems. I don't like that.

Jun 06 '24 13:06 ben221199

The Proquint exercise makes me think we should update the multibase RFC doc for it to account for uneven byte inputs; I had to make a choice about that and I think it's a reasonable one.

Okay, another option that doesn't break proquint spec, but gives us the possibility to indicate an odd amount of bytes would be this:

The p is our prefix. Then it is followed by ro-. Then follows the real 16-bit chunk encoded proquint output.

How do we tell the system that the content has an odd number of bytes? Well, like this:

First we decide what byte we want for the padding. I think 0x00 is fine. That will cause the additional ab in the end. Then, instead of ro-, we output or- to indicate the last byte can be removed from the end.

Because the ro- thing seems very Multibase specific, I don't think this will cause problems with systems elsewhere.

Jun 06 '24 13:06 ben221199

So:

por-hidoj-katoj-kunuh-lanod-kudon-lonoj-fadoj-linoj-lanun-lidom-kojov-kisod-fahab (Odd)
pro-lojoj-lasob-kujod-kunon-fabod (Even)
por-kodoj-kudos-kusob-litoz-lanos-kibab (Odd)
por-badun-kijug-fadot-kajov-kohob-fahab (Odd)
pro-babab-lojoj-lasob-kujod-kunon-fabod (Even)

Jun 06 '24 13:06 ben221199

That's an interesting option, although I still prefer mine. cvc is still pronounceable and 3 vs 5 make it very clear that there's truncation. But let's go fishing a bit more! I'll check with some folks who have some connection with the source of proquint in the multibase table. Then maybe we just go ask dsw's opinion over at https://github.com/dsw/proquint. But first I did some digging and here's what I discovered:

Proquint was first introduced, contributed by an external contributor to go-namesys but it was put together by Jeromy and merged by Juan in 2014, right back in the very early days of the whole stack: https://github.com/ipfs/go-namesys/commit/59eb2350afba56e0461dda65fd0999e7e17d513e
The implementation used depends on there being an even number of bytes! https://github.com/Bren2010/proquint/blob/master/proquint.go - that thing is going to panic if it gets an odd number to encode, but it'll always decode to an even number. I suppose the intention was to just use it for ipv4 addresses.
It was finally removed from go-ipfs in 2021 with a big resolver refactor: https://github.com/ipfs/kubo/pull/8068 (go-namesys was bumped to v0.2.0 and removed it entirely).
One of the problems that lead to its removal was that it lead to confusing errors for users, afaik nobody ever used them but it ended up getting involved in decoding bad ipns or dnslink strings, Error: not a valid proquint string (what on earth is this??), that was pushed out in go-ipfs v0.9.0: https://github.com/ipfs/kubo/releases/tag/v0.9.0
With the removal from go-ipfs, it was then shifted to the multibase table, along with the original mini-spec in there: https://github.com/multiformats/multibase/pull/78 - the v0.9.0 release said "We have removed support for proquints as they were out of place and largely unused, however proquints are valid multibases so if there is renewed interest in them there is a way forward".

And that's where we are.

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

Jun 07 '24 05:06 rvagg

But I still see no evidence of needing to deal with the odd byte problem. But let's fish for opinions!

Well, it is the only Multibase encoding that doesn't support odd bytes yet. In my opinion, a good Multibase encoding should be able to encode any data. For Proquints that is now not the case. I think there are 3 options:

Remove Proquints from Multibase (I don't like this option)
Keep Proquints and make it odd byte compatible without breaking specs (e.g. pro and por or another solution)
Keep Proquints and make it odd byte compatible with breaking specs (I don't like this option either)

Tagging @dsw and @randomwalker here.

Jul 17 '24 08:07 ben221199

I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping pro as a stable prefix so as not to have two parsing branches at the beginning for a multiformats consumer, and just shortening the last block to 3 (cvc) characters.

Jul 17 '24 10:07 rvagg

I see: the problem is that what if the number of bits gets cut off in the middle of a proquint or in the middle of a proquint character?

The problem really is that you are implicitly encoding the length of the string in the length of the characters. Hex has the same problem: the length is expected to be a multiple of 4 bits.

The solution that CPUs pick is they just zero-extend or sign-extend out to the machine word size and then they encode the length otherwise. Is this somehow not an option?

Daniel

On Wed, Jul 17, 2024 at 3:40 AM Rod Vagg @.***> wrote:

I disagree that it breaks the proquint spec by shortening the last section. Any block of 3 characters using the same rules is still pronounceable and I don't think you can read the spec as being particularly strict about this, any more than its use being only applicable to 32-bit integers which seems to be what the main implementations are being used for. So I'm still in favour of keeping pro as a stable prefix so as not to have two parsing branches at the beginning for a multiformats consumer, and just shortening the last block to 3 (cvc) characters.

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2232993834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XACNWTWIDVR2VHY4RDZMZCYXAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZSHE4TGOBTGQ . You are receiving this because you were mentioned.Message ID: @.***>

Jul 17 '24 22:07 dsw

Hi @dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles without padding.

However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x.

The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options:

@rvagg option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016)
@ben221199 option:
- If the input has an even amount of bytes, do nothing.
- If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por).

Jul 18 '24 08:07 ben221199

Hm. I think changing the prefix would be weird. I think we should keep the prefix but extend the encoding. Let me think about it.

On Thu, Jul 18, 2024 at 1:17 AM Ben @.***> wrote:

Hi @dsw https://github.com/dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles for example without padding.

However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x.

The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options:

@rvagg https://github.com/rvagg option:

If the input has an even amount of bytes, do nothing.

If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016)

@ben221199 https://github.com/ben221199 option:

If the input has an even amount of bytes, do nothing.

If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por).

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2235907987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Jul 18 '24 18:07 dsw

Encode the last 2 bits using another vowel, preceded by a spoken but unwritten "y". To be canonical, do this only in the case where the encoding would require a consonant at the end of a string but not enough bits remain to make one. Now we can encode any string of even bit length.

The point of proquints was to be easily pronounceable and understandable, and the vowels "a", "i", "o", "u" all satisfy those properties when next to each other. I made the encoding alternate consonant-vowel mostly just to prevent unpronounceable consonant clusters.

You could still end up with weird vowel clusters. However, (1) those are not that bad to say, and (2) and we could have a convention like Hungarian where you often put a "j" between them. Since "j" is used but "y" is unused we could use that instead. Actually writing the "y" would make the width irregular. However we could make the "y" pronounced without being written whenever there are two vowels in a row.

Note that doing this requires the following pronunciation rules to prevent ambiguity:

g = hard g as in "golf",
j = hard j as in "just",
y = soft y as in "yes".

So in your example: 0x00 = 0qba[y]a.

Daniel

On Thu, Jul 18, 2024 at 11:51 AM Daniel Wilkerson < @.***> wrote:

Hm. I think changing the prefix would be weird. I think we should keep the prefix but extend the encoding. Let me think about it.

On Thu, Jul 18, 2024 at 1:17 AM Ben @.***> wrote:

Hi @dsw https://github.com/dsw, I see your point with hexadecimal. Many encodings/bases in multibase are not able to properly encode and decode nibbles for example without padding.

However, in the computer world we have decided that 8 bits (octet) is the smallest piece of information we can send over the internet and it is also standardized in many other ways. Because hex encodes each nibble (4 bits) with an alphanumeric and two of those nibbles fit in a byte, there is no problem. For every data with length x, you know you need a hex string with length 2x.

The problem with Proquints is that we are now not talking about things smaller than a byte/octet, but a thing that is bigger: 16 bits. When encoding data with an even amount x of bytes, you know that the output has x/2 proquint words. The problem occurs when the input data has an odd amount of bytes. The formula x/2 will output a fraction, so we came up with two options:

@rvagg https://github.com/rvagg option:

If the input has an even amount of bytes, do nothing.

If the input has an odd amount of bytes, cut the (last) proquint in half. So, hex 00 00 would be babab, but hex 00 would be bab. The unused last two bits are set to zero. (In my opinion this would break the Proquint spec at https://arxiv.org/html/0901.4016)

@ben221199 https://github.com/ben221199 option:

If the input has an even amount of bytes, do nothing.

If the input has an odd amount of bytes, add 0x00 to the end. The data has now an even amount of bytes, so it is possible to encode it as usual. However, to indicate we padded the data with one byte, we use a different prefix. Normally it is p + ro (pro), but when the data input has odd bytes, it will become p + or (por).

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2235907987, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XARME6CMBBLZSHSVR3ZM523HAVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZVHEYDOOJYG4 . You are receiving this because you were mentioned.Message ID: @.***>

Jul 18 '24 19:07 dsw

Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations?

And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya?

I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding.

Jul 18 '24 20:07 ben221199

Ah, I never said in the spec that I intended the unambiguous indicator of a proquint to be the prefix "0q", to parallel "0x" for hex.

The "y" is in square brackets because you do not write it, you just say it.

I think changing the prefix to indicate length is frankly a really bad idea: the format has not changed and that is what a prefix usually indicates.

So I am going to update the spec to clarify all of this. I appreciate your pointing out the problem to me.

On Thu, Jul 18, 2024 at 1:09 PM Ben @.***> wrote:

Seems possible to use letters that are not yet in use by the spec. However, doesn't that break current spec and existing implementations?

And how does 0qba[y]a work? Where does 0q (zero Q) come from? I know 00 00 is babab as seen below, so I would only expect the last 2 or 3 letters different. Did you just mean baya?

image.png (view on web)

I think the advantage of my option is that we don't have to change the proquint spec (which is focussed on 16-bit chunks in its core) and that the padding indication is outside of this encoding/decoding.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jul 18 '24 21:07 dsw

Ah, I never said in the spec that I intended the unambiguous indicator of a proquint to be the prefix "0q", to parallel "0x" for hex.

Ah, fair enough. Seems fine to me. (Multibase uses a different prefix of course.)

The "y" is in square brackets because you do not write it, you just say it.

So, I see baa, and I say baya? 🤔 CVV indeed makes 8 bits, where CVC makes 10 bits. It makes sense in some way, but many implementations have to be changed. (And I care about backward compatability.)

I think changing the prefix to indicate length is frankly a really bad idea: the format has not changed and that is what a prefix usually indicates.

We wouldn't indicate length, we would indicate padding. The Multibase prefix letter p still stays p. The letters ro don't have a meaning other then making the prefix sound like pro, as in proquint. My first idea was using the inverse of it, or, to make por. But, because the data is "padded", it would also be possible to use ad, which makes pad. 😎 At least, the conclusion is we only change the meaningless second and third letters in this case.

Jul 19 '24 07:07 ben221199

Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix pro/por/pad, but that doesn't work with 0q. If you want to solve it for 0q too, then I understand your option to change the spec, but in any other case, I think it is too drastic a change.

Jul 19 '24 07:07 ben221199

I never said that Proquints is 16-bit based. I said Proquints are "Identifiers that are Readable, Spellable, and Pronounceable". You are trying to maintain a constraint that you made up and that makes no sense to maintain.

Lexing two vowels in a row is the least drastic change and is backwards compatible. There is no ambiguity in lexing the format, which makes it backwards compatible. If you just allow any letter in any position, the code actually gets simpler, rather than adding any special case for the new semantics.

Changing the prefix is a very odd choice and I would like you to please not make it. I have literally never seen someone change a prefix to indicate a length. It is frankly a bad factoring to associate a prefix with a string length; I have literally never seen anyone even consider doing that.

On Fri, Jul 19, 2024 at 12:59 AM Ben @.***> wrote:

Now I'm thinking, it seems you want this 16-bit problem to be solved in Proquint itself, so that it will support 8-bit from now on too. But is it a problem if Proquint just happens to be 16-bit chunked and we have to solve the problem elsewhere? I tried to solve it using the Multibase prefix pro/ por/pad, but that doesn't work with 0q. If you want to solve it for 0q too, then I understand your option to change the spec, but in any other case, I think it is too drastic a change.

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2238591988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XEX5IW7WPXDEOGOVY3ZNDBO3AVCNFSM6AAAAABI2SERWKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZYGU4TCOJYHA . You are receiving this because you were mentioned.Message ID: @.***>

Jul 19 '24 11:07 dsw

So I am going to update the spec to clarify all of this. I appreciate your pointing out the problem to me.

@dsw did you ever get around to this? I'm unsure from looking at https://github.com/dsw/proquint/commits/master/, but suspect you might not?

Feb 28 '25 02:02 vorburger

I have not gotten to that.

Daniel

On Thu, Feb 27, 2025 at 6:18 PM Michael Vorburger @.***> wrote:

So I am going to update the spec to clarify all of this. I appreciate your pointing out the problem to me.

@dsw https://github.com/dsw did you ever get around to this? I'm unsure from looking at https://github.com/dsw/proquint/commits/master/, but suspect you might not?

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2689549619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XGRFEMAY3VE7AW4EDL2R7BQNAVCNFSM6AAAAABYBIC4ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBZGU2DSNRRHE . You are receiving this because you were mentioned.Message ID: @.***> [image: vorburger]vorburger left a comment (multiformats/multibase#125) https://github.com/multiformats/multibase/pull/125#issuecomment-2689549619

So I am going to update the spec to clarify all of this. I appreciate your pointing out the problem to me.

@dsw https://github.com/dsw did you ever get around to this? I'm unsure from looking at https://github.com/dsw/proquint/commits/master/, but suspect you might not?

— Reply to this email directly, view it on GitHub https://github.com/multiformats/multibase/pull/125#issuecomment-2689549619, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL6XGRFEMAY3VE7AW4EDL2R7BQNAVCNFSM6AAAAABYBIC4ZWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBZGU2DSNRRHE . You are receiving this because you were mentioned.Message ID: @.***>

Feb 28 '25 04:02 dsw