utf8.js
utf8.js copied to clipboard
The README should probably mention that output only looks like UTF-8, but isn't actual UTF-8
This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.
Take your README example with the copyright character.
// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
, each of \xXX
sequences in JavaScript produces a standalone code point, so \xA9
natively will be represented as UTF-16 in JavaScript (well, UCS2, really), which can be seen here:
console.log(Buffer.from('\xA9', 'utf16le')
, which yields a code point U+00A9 in little endian notation:
<Buffer a9 00>
This is how one can generate an actual UTF-8 sequence. Either of these will work (the default encoding is UTF-8):
console.log(Buffer.from('\xA9'))
console.log(Buffer.from('\xA9', 'utf8'))
, and will produce UTF-8 bytes, which are good for hashing and other uses where it matters:
<Buffer c2 a9>
<Buffer c2 a9>
For example, this yields the correct MD5 hash of the \xA9
represented as UTF-8 because update
does the same transformation Buffer.from
uses:
console.log(crypto.createHash('md5').update('\xA9').digest('hex'))
, which is a541ecda3d4c67f1151cad5075633423
. This will not produce the correct hash:
console.log(crypto.createHash('md5').update(utf8.encode('\xA9')).digest('hex'))
, which actually hashes <Buffer c3 82 c2 a9>
and yields 1b4c0262ce2f67450c4ecb3026ab1350
.
This fooled even Microsoft, who referenced utf8
in their docs, which only works because their input is always ASCII, which makes utf8.encode()
a no-op.
https://docs.microsoft.com/en-us/rest/api/eventhub/generate-sas-token#nodejs
Do you want to propose a patch?
I came across of this module because it was used in my project in the way Microsoft describes in their docs, which didn't work for non-ASCII characters. The fix was simple in my case - just use Buffer.from()
, which natively produces UTF-8, but it took me a bit to realize what's going on and I thought a note in README clarifying that generated strings are intended for display purposes would save time for other people.
I wouldn't be the best person to propose a description for this, though, because I'm not familiar with project history and intent. If you think it's clear enough what encode
/decode
produce, please, go ahead and close this issue. Sorry about the issue, in this case.
This module encodes a string to look like a UTF-8 string, which may be used for online UTF-8 demos, but as far as bytes are concerned, which is important for hashing, etc, the output is not actually UTF-8.
The output is UTF-8 represented as a string with one byte per character, which can be easy to misuse – as you’ve seen – but is very much a thing. It’s the input format escape
and btoa
expect, for example. If you ever have it in Node.js for some reason, it’s the binary
encoding, e.g. Buffer.from(utf8.encode(text), 'binary')
.
As seen in the readme:
utf8.js has been tested in at least Chrome 27-39, Firefox 3-34, Safari 4-8, Opera 10-28, IE 6-11, Node.js v0.10.0, Narwhal 0.3.2, RingoJS 0.8-0.11, PhantomJS 1.9.0, and Rhino 1.7RC4.
this package supports environments that don’t even have typed arrays.
In Node.js and modern browsers, UTF-8 encoding directly to bytes is built in as Buffer
and TextEncoder
.
@charmander
If you ever have it in Node.js for some reason, it’s the binary encoding, e.g. Buffer.from(utf8.encode(text), 'binary').
Note that binary
is a synonym for latin1
, (ISO-8859-1, so what happens is that in Buffer.from(utf8.encode('あ'), 'latin1')
, the encode
call yields a JS string "\u00e3\u0081\u0082"
, which is then encoded in Latin1 as E3 81 82
by Buffer.from()
. This matches the sequence generated by Buffer.from('あ', 'utf8')
. I can see how it works out for people where Buffer
-like functionality is not available.
In new code one can also use new TextEncoder().encode('あ')
, which yields Uint8Array
with UTF-8 code unit values.