js-multiformats icon indicating copy to clipboard operation
js-multiformats copied to clipboard

Identity encoding decoding doesn't produce the same data

Open CMCDragonkai opened this issue 3 years ago • 4 comments

I'm not sure if identity encoding is meant to be used like this, but I noticed that after decoding, you don't get the same data:

import { bases } from 'multiformats/basics';

const codec = bases['identity'];

const u = new Uint8Array([
    6, 22, 184, 240, 237, 178,
  112,  0, 150, 137, 182,  54,
  220,  1, 217, 221
]);

const s = codec.encode(u);

const u_ = codec.decode(s);

console.log(u_);

/*
Uint8Array(36) [
    6,  22, 239, 191, 189, 239, 191, 189,
  239, 191, 189, 239, 191, 189, 112,   0,
  239, 191, 189, 239, 191, 189, 239, 191,
  189,  54, 239, 191, 189,   1, 239, 191,
  189, 239, 191, 189
]
*/

CMCDragonkai avatar Oct 17 '21 02:10 CMCDragonkai

You're hitting limitations of JavaScript's UTF-8 handling. There are some bytes that JavaScript just won't properly preserve during a bytes->string->bytes round-trip. The in-built assumption is that conversion to UTF-8 from bytes involves actual UTF-8 characters, unlike some languages, such as Go which can []byte(string([]byte(...))) without loss (i.e. their strings can hold non-UTF-8 bytes).

To illustrate, take your 3rd byte, which can't be represented as UTF-8 (note how the first 2 are present in the round-trip):

> new TextDecoder().decode(new Uint8Array([184]))
'�'
> new TextDecoder().decode(new Uint8Array([184])).charCodeAt(0)
65533
> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([184])))
Uint8Array(3) [ 239, 191, 189 ]

So you can see that invalid UTF-8 bytes get converted to U+FFFD, i.e. 65533, which is the sequence of 3 bytes you see repeated in your resulting array: 239, 191, 189 - every time you see these, you can assume that it's a non-UTF-8 byte that got lost in translation.

The identity multibase doesn't have much choice here, it's only safe to use with bytes that can be properly converted with JavaScript to strings, or use a multibase that maps characters to avoid this problem (which is one of the points of using base encoding!).

I hope that helps explain the situation, even if it probably doesn't give you an easy solution.

rvagg avatar Oct 18 '21 02:10 rvagg

I used codePointAt to convert to JS binary strings and back. https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary

Maybe that can be used instead?

CMCDragonkai avatar Oct 18 '21 04:10 CMCDragonkai

Hm, that might not be a bad idea since codepoint addressing is now standard across runtimes.

rvagg avatar Oct 18 '21 06:10 rvagg

Yea I used it for the above example and I compared it to multibase to see if there was any differences.

https://github.com/MatrixAI/js-id/blob/4ea34f2b50e8f259576fc2f8bb9f80d9a167e1a1/src/utils.ts#L75-L85

function toString(id: Uint8Array): string {
  return String.fromCharCode(...id);
}

function fromString(idString: string): Id | undefined {
  const id = IdInternal.create(16);
  for (let i = 0; i < 16; i++) {
    id[i] = idString.charCodeAt(i);
  }
  return id;
}

And it worked whereas multibase failed.

CMCDragonkai avatar Oct 18 '21 06:10 CMCDragonkai