utf8.js Codepoint arrays and binary strings

Codepoint arrays and binary strings

Open devongovett opened this issue 9 years ago • 9 comments

What would you think about a PR to replace binary strings with arrays of bytes, or Buffers/typed arrays? e.g. accept arrays as input to the decoder, and produce them from the encoder.

Also, it would be nice to be able to pass arrays of codepoints to the encoder and receive an array of codepoints from the decoder instead of strings, perhaps as an option? Sometimes I need to do additional processing at the codepoint level, and it is probably a a waste of time to encode the utf8 to a ucs2 string, and then decode that again to get codepoints.

Thoughts? I'm happy to write PRs for this, just wanted to get your opinion first.

Dec 29 '14 00:12 devongovett

Sounds good, but I should rewrite this project first based on the exact algorithm in the Encoding Standard (see open issues).

Dec 29 '14 16:12 mathiasbynens

Hmm, looks like there is an implementation of that in the polyfill here. The algorithm that is specified looks like it would be kinda slow though. Might want to write something different that still conforms to the spec, as they suggest, rather than using their algorithm directly.

Have you seen this? A port to JS might be worthwhile. It's small, fast, and correct.

What are the current differences between this library and the standard, in terms of behavior?

Dec 29 '14 23:12 devongovett

What are the current differences between this library and the standard, in terms of behavior?

The only difference is https://github.com/mathiasbynens/utf8.js/issues/3.

Jan 01 '15 21:01 mathiasbynens

#3 is now fixed, so go ahead, @devongovett!

One thing that would be nice is backward compatibility with older browsers. Obviously IE6 won’t support typed arrays but it would be nice if utf8.js could fall back to byte strings (as currently used) gracefully. Thoughts?

Jan 08 '15 11:01 mathiasbynens

How about just using normal JS arrays if typed arrays aren't available? Or we could just skip the typed arrays entirely. The encoder doesn't really know how big to make the buffer ahead of time (unless we go through the string twice, once before allocating the buffer, and once after) anyway, so the easiest way to write it would be to use a normal resizable JS array internally before converting it to a typed array at the end. I'm not sure how much of a performance benefit returning typed arrays would have then. We could just always return a JS array, and if the consumer of the library wants a typed array, they can easily convert it themselves. What do you think?

Jan 08 '15 15:01 devongovett

Sounds good to me.

Jan 08 '15 15:01 mathiasbynens

What is the status of this? I have a byte array I received off the wire and I would like to be able to just pass it directly to this function without having to make a copy that turns each byte into an escaped hex value in a string.

Feb 22 '17 23:02 MicahZoltu

Alright. I need this. So I took a stab at implementing it https://github.com/mathiasbynens/utf8.js/pull/28

May 28 '17 21:05 samal-rasmussen

I am wondering if this could be an efficient way to store binary data as UTF-8 strings, where UTF-8 is allowed but binary not.

So given a bunch of binary data, convert it to a valid UTF-8 string, escaping invalid sequences and add padding + padcount at the end. If the binary data happens to be a valid UTF-8 string, it would be stored with 1 byte overhead (padcount), and if the binary data is FEFFFEFF... I suppose it would escape every byte :)

Sort of idle musing, I suppose that any space savings are dwarfed by the CPU overhead.

Jun 19 '17 12:06 wmertens

utf8.js utf8.js copied to clipboard

Codepoint arrays and binary strings

utf8.js
utf8.js copied to clipboard