Base122
Base122 copied to clipboard
<?> characters
Hello Kevin,
I read your blog post on this and am excited. However, I notice that base122 produces characters that are not displayed in my browser. Perhaps you (or anybody) could add a layer of post-processing that converts base122 output to "readable" utf-8?
By that I mean one could create a mapping from the 1 and 2 byte sequences of base122 to 1, 2, 3, and 4 byte sequences of a more "readable" utf-8 encoding result. We could determine which 2-byte sequences are least readable, and replace a 1-byte 2-byte segment with a "readable" 3-byte segment.
Hi, thanks for taking an interest in base122! That's correct, some characters are expected to be unrenderable by a browser. E.g. ascii control characters.
Creating a post-processing layer would make debugging the output of base122 easier. However, that post-processing layer could be converting base122 to base64, right?
This might be related that I tried this library with this minimal example and the encoding seems to not be reversible.
Example Code
const data = { a: "美味しい", b: "مزیدار" };
let originalString = JSON.stringify(data);
console.log("Original:", originalString);
let encodedRaw = base122.encode(originalString);
let encodedString = Buffer.from(encodedRaw, "utf-8").toString();
console.log("Encoded:", encodedString);
let decodedRaw = base122.decode(encodedString);
let decodedString = Buffer.from(decodedRaw).toString("utf-8");
console.log("Decoded:", decodedString);
Output
Original: {"a":"美味しい","b":"مزیدار"}
Encoded: =HL↕◄hE9UhB◄0Db◄D$)K↑/‼L$'h
Decoded: {"a":"�sWD","b":"E2�/'1"}
I am pretty sure its me using it wrong but I cant spot the mistake for the life of me.
@devqazi Thank you for the report.
encode
expects a String argument to meet the conditions of btoa
:
strings in which the code point of each character occupies only one byte.
Javascript String uses UTF-16. Code points in the input string exceed what one byte can store:
Example:
// Javascript represents string as sequence of UTF-16.
let originalString = "美";
console.log(`originalString.codePointAt(0)=${originalString.codePointAt(0)}`); // Prints:
// Prints: originalString.codePointAt(0)=32654
Try encoding the string to a Buffer first:
const data = { a: "美味しい", b: "مزیدار" };
let originalString = JSON.stringify(data);
console.log("Original:", originalString);
// Encode string to sequence of bytes in UTF-8.
const originalStringBuffer = Buffer.from(originalString, "utf-8");
let encodedRaw = base122.encode(originalStringBuffer);
let encodedString = Buffer.from(encodedRaw, "utf-8").toString();
console.log("Encoded:", encodedString);
let decodedRaw = base122.decode(encodedString);
let decodedString = Buffer.from(decodedRaw).toString("utf-8");
console.log("Decoded:", decodedString);
Outputs:
Original: {"a":"美味しい","b":"مزیدار"}
Encoded: =HL↕◄hEg_#יˏG☺Kxp↑XGΖf♂XY6qME?1'l,$'h
Decoded: {"a":"美味しい","b":"مزیدار"}
I think an improvement is to have encode
throw an exception if given a string with code points that occupy more than one byte. This is proposed in https://github.com/kevinAlbs/Base122/pull/15
The key info here was JS storing strings as UTF-16. I don't know why I thought JS uses UTF-8 for strings by default.
https://github.com/kevinAlbs/Base122/pull/15 is merged. Closing.