Base122 icon indicating copy to clipboard operation
Base122 copied to clipboard

<?> characters

Open kizerkizer opened this issue 6 years ago • 1 comments

Hello Kevin,

I read your blog post on this and am excited. However, I notice that base122 produces characters that are not displayed in my browser. Perhaps you (or anybody) could add a layer of post-processing that converts base122 output to "readable" utf-8?

By that I mean one could create a mapping from the 1 and 2 byte sequences of base122 to 1, 2, 3, and 4 byte sequences of a more "readable" utf-8 encoding result. We could determine which 2-byte sequences are least readable, and replace a 1-byte 2-byte segment with a "readable" 3-byte segment.

kizerkizer avatar Apr 18 '18 01:04 kizerkizer

Hi, thanks for taking an interest in base122! That's correct, some characters are expected to be unrenderable by a browser. E.g. ascii control characters.

Creating a post-processing layer would make debugging the output of base122 easier. However, that post-processing layer could be converting base122 to base64, right?

kevinAlbs avatar Apr 18 '18 21:04 kevinAlbs

This might be related that I tried this library with this minimal example and the encoding seems to not be reversible.

Example Code

const data = { a: "美味しい", b: "مزیدار" };
let originalString = JSON.stringify(data);
console.log("Original:", originalString);

let encodedRaw = base122.encode(originalString);
let encodedString = Buffer.from(encodedRaw, "utf-8").toString();
console.log("Encoded:", encodedString);

let decodedRaw = base122.decode(encodedString);
let decodedString = Buffer.from(decodedRaw).toString("utf-8");
console.log("Decoded:", decodedString);

Output

Original: {"a":"美味しい","b":"مزیدار"}
Encoded: =HL↕◄hE9UhB◄0Db◄D$)K↑/‼L$'h
Decoded: {"a":"�sWD","b":"E2�/'1"}

I am pretty sure its me using it wrong but I cant spot the mistake for the life of me.

devqazi avatar Oct 11 '23 15:10 devqazi

@devqazi Thank you for the report.

encode expects a String argument to meet the conditions of btoa:

strings in which the code point of each character occupies only one byte.

Javascript String uses UTF-16. Code points in the input string exceed what one byte can store:

Example:

// Javascript represents string as sequence of UTF-16.
let originalString = "美";
console.log(`originalString.codePointAt(0)=${originalString.codePointAt(0)}`); // Prints: 
// Prints: originalString.codePointAt(0)=32654

Try encoding the string to a Buffer first:

const data = { a: "美味しい", b: "مزیدار" };
let originalString = JSON.stringify(data);
console.log("Original:", originalString);

// Encode string to sequence of bytes in UTF-8.
const originalStringBuffer = Buffer.from(originalString, "utf-8");

let encodedRaw = base122.encode(originalStringBuffer);
let encodedString = Buffer.from(encodedRaw, "utf-8").toString();
console.log("Encoded:", encodedString);

let decodedRaw = base122.decode(encodedString);
let decodedString = Buffer.from(decodedRaw).toString("utf-8");
console.log("Decoded:", decodedString);

Outputs:

Original: {"a":"美味しい","b":"مزیدار"}
Encoded: =HL↕◄hEg_#יˏG☺Kxp↑XGΖf♂XY6qME?1'l,$'h
Decoded: {"a":"美味しい","b":"مزیدار"}

I think an improvement is to have encode throw an exception if given a string with code points that occupy more than one byte. This is proposed in https://github.com/kevinAlbs/Base122/pull/15

kevinAlbs avatar Oct 15 '23 01:10 kevinAlbs

The key info here was JS storing strings as UTF-16. I don't know why I thought JS uses UTF-8 for strings by default.

devqazi avatar Oct 15 '23 10:10 devqazi

https://github.com/kevinAlbs/Base122/pull/15 is merged. Closing.

kevinAlbs avatar Oct 17 '23 01:10 kevinAlbs