js-multiformats
js-multiformats copied to clipboard
Identity encoding decoding doesn't produce the same data
I'm not sure if identity encoding is meant to be used like this, but I noticed that after decoding, you don't get the same data:
import { bases } from 'multiformats/basics';
const codec = bases['identity'];
const u = new Uint8Array([
6, 22, 184, 240, 237, 178,
112, 0, 150, 137, 182, 54,
220, 1, 217, 221
]);
const s = codec.encode(u);
const u_ = codec.decode(s);
console.log(u_);
/*
Uint8Array(36) [
6, 22, 239, 191, 189, 239, 191, 189,
239, 191, 189, 239, 191, 189, 112, 0,
239, 191, 189, 239, 191, 189, 239, 191,
189, 54, 239, 191, 189, 1, 239, 191,
189, 239, 191, 189
]
*/
You're hitting limitations of JavaScript's UTF-8 handling. There are some bytes that JavaScript just won't properly preserve during a bytes->string->bytes round-trip. The in-built assumption is that conversion to UTF-8 from bytes involves actual UTF-8 characters, unlike some languages, such as Go which can []byte(string([]byte(...)))
without loss (i.e. their string
s can hold non-UTF-8 bytes).
To illustrate, take your 3rd byte, which can't be represented as UTF-8 (note how the first 2 are present in the round-trip):
> new TextDecoder().decode(new Uint8Array([184]))
'�'
> new TextDecoder().decode(new Uint8Array([184])).charCodeAt(0)
65533
> new TextEncoder().encode(new TextDecoder().decode(new Uint8Array([184])))
Uint8Array(3) [ 239, 191, 189 ]
So you can see that invalid UTF-8 bytes get converted to U+FFFD
, i.e. 65533
, which is the sequence of 3 bytes you see repeated in your resulting array: 239, 191, 189
- every time you see these, you can assume that it's a non-UTF-8 byte that got lost in translation.
The identity multibase doesn't have much choice here, it's only safe to use with bytes that can be properly converted with JavaScript to strings, or use a multibase that maps characters to avoid this problem (which is one of the points of using base encoding!).
I hope that helps explain the situation, even if it probably doesn't give you an easy solution.
I used codePointAt to convert to JS binary strings and back. https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary
Maybe that can be used instead?
Hm, that might not be a bad idea since codepoint addressing is now standard across runtimes.
Yea I used it for the above example and I compared it to multibase to see if there was any differences.
https://github.com/MatrixAI/js-id/blob/4ea34f2b50e8f259576fc2f8bb9f80d9a167e1a1/src/utils.ts#L75-L85
function toString(id: Uint8Array): string {
return String.fromCharCode(...id);
}
function fromString(idString: string): Id | undefined {
const id = IdInternal.create(16);
for (let i = 0; i < 16; i++) {
id[i] = idString.charCodeAt(i);
}
return id;
}
And it worked whereas multibase failed.