iconv-lite icon indicating copy to clipboard operation
iconv-lite copied to clipboard

Support using Uint8Array as the "bytes" data type (instead of Buffer)

Open ashtuchkin opened this issue 4 years ago β€’ 21 comments

This will be helpful with browser usage (no need to ship Buffer to the browser) and would make it possible to support WHATWG Encoding interface.

  • [x] Introduce a concept of "Backend", with platform-specific functions. Node backend will refer to Buffer, Web backend would use Uint8Array-s.
  • [x] Add testing infrastructure to test encodings in all backends and in browsers.
  • [ ] Convert all encodings to use backends.

ashtuchkin avatar May 31 '20 05:05 ashtuchkin

This would be a requirement for VSCode to ship iconv-lite as part of the web experience πŸ‘

bpasero avatar Jun 17 '20 12:06 bpasero

I've made an attempt for step 1 in the linked PR. If it is not at all what was expected, feel free to close it :)

gyzerok avatar Jun 18 '20 03:06 gyzerok

After another look at the codebase, I think we'd need something more involved than what I described above. Current usages of Buffer class/instances are the following:

Migratable:

  • Buffer.from([bytes]) -> Uint8Array.from([bytes])
  • Buffer.alloc(n) -> new Uint8Array(n)
  • Buffer.alloc(n, default) -> new Uint8Array(n); buf.fill(default)
  • Buffer.concat([buf, buf]) -> helper function creating a new buf and copy.
  • buf[i] -> arr[i]
  • buf.slice(a, b) -> arr.slice(a, b)

Tricky:

  • Buffer.from(str, encoding), where encoding is 'utf8', cesu, ucs2, 'hex'. base64.
  • buf.toString(encoding) // same encodings as above
  • StringDecoder class that is Node.js-specific.

I'll need to think more about it, but for now it looks like just replacing Buffer with Uint8Array would not help (specifically buf.toString(encoding) would be tricky). We probably need to create a different class that would provide hooks for all the functionality above and have a Buffer-based implementation and a Uint8Array-based one. We'll also likely need to involve TextDecoder/TextEncoder classes to support "native" encodings.

ashtuchkin avatar Jun 18 '20 10:06 ashtuchkin

@ashtuchkin if it helps, feel free to copy anything from our VSBuffer helper that is meant as a drop-in replacement for Buffer to work in node.js and browser environments. The idea being that it will actually use node.js Buffer if available and otherwise fallback to Uint8Array: https://github.com/microsoft/vscode/blob/master/src/vs/base/common/buffer.ts

bpasero avatar Jun 18 '20 12:06 bpasero

@ashtuchkin which versions of browsers/node are you aiming to support in the library?

gyzerok avatar Jun 19 '20 02:06 gyzerok

Node versions 0.10+ Browsers - I don't have a good sense for now. I'll probably target the ones that have native Uint8Array-s. TextEncoder/TextDecoder-s can probably be polyfilled as needed.

ashtuchkin avatar Jun 19 '20 06:06 ashtuchkin

Hello @ashtuchkin!

Big thanks to you for finishing a lot of work already to make this migration possible! πŸŽ‰

I would like to help you moving forward with this. But before doing any PRs it would be great to get your feedback on the following plan.

The plan

Step 1

Change tests so that they don't expect Buffer as an output => we can gradually migrate iconv-lite to use Uint8Array exclusively.

Here is an example of what it would mean:

-assert.strictEqual(iconv.encode(testStringBase64, "base64").toString("binary"), testString);
+assert.strictEqual(Buffer.from(iconv.encode(testStringBase64, "base64")).toString("binary"), testString);

Step 2

As you suggested, create our own light module with helpers to do things like buffer.toString(encoding) and Buffer.from(str, encoding) and migrate everything to use it instead of Buffer.

Initially I suggest we just draft the API and internally still use Buffer. This will make it a single place to be modified when doing actual migration away from Buffer.

So initial implementation could be something like

function toString(buffer, encoding) {
  if (!Buffer.isBuffer(buffer)) {
    buffer = Buffer.from(buffer);
  }
  return buffer.toString(encoding);
}

function toBuffer(str, encoding) {
  return Buffer.from(str, encoding);
}

Step 3

Node has a pure JavaScript implementation for StringDecoder with a single exception - it uses Buffer. We can now copy it and change Buffer usage to our own helpers.

Step 4

Looks like we are all set now and the last thing we need to do is to implement our helpers for the Buffer-less environments.

Step 5

Try to change tests so that the same suite runs both in Node and in the browser. The only tricky part here I see is your usage of iconv, which we won't be able to run in the browser. Maybe there is a way to pregenerate fixtures with it before tests, so we can simple use them?

First steps

I think that steps 1 and 2 could be actually done in parallel in a way that 1 module is selected and both internal Buffer usages and tests are changed for it. And personally I would prefer this way of doing things, because: a) it feels more fun to me b) we can parallelize work on different codecs if you are up for it.

If all sounds good to you I suggest that I make my first PR doing steps 1 and 2 for SBCS codec.

Other

Also given your target versions for support I think we can freely get rid of safer-buffer now.

gyzerok avatar Jul 11 '20 02:07 gyzerok

Thank you for the kind words! and I'm happy you'd like to move forward with this, lets do it :)

The plan overall looks reasonable. Here's a couple of thoughts I had for this migration:

First, let's clearly define a success criteria for this project. For me it's the following: 1. Make iconv-lite work in the browser without Buffer shims (use Uint8Array as the 'binary data' type). 2. Make all tests work in the browser and add a Travis CI integration for it. 3. No regression wrt Node.js environments. Is it what you're also thinking? Did I miss anything?

Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections. I see it something like this:

// lib/node-index.js   // node.js entrypoint; set as "main" field in package.json
var nodeBufferAPI = require("./node-api")
module.exports = require("./index-common.js")(nodeBufferAPI)

// lib/web-index.js   // browser entrypoint; set as "browser" field in package.json
var webBufferAPI = require("./web-api")
module.exports = require("./index-common.js")(webBufferAPI)

// lib/index-common.js 
module.exports = function(bufferAPI) {
    // initialize iconv-lite like it's happening today
    return iconv;
}

Third, what's this bufferAPI? In the code we have the following cases:

  1. decode input (bytes) processing - this already supports both Buffer and Uint8Array after our recent commits.
  2. decode output (string) creation - currently this mostly works by allocating a Buffer/Uint8Array, filling it, then slice() it and decode as UCS2 (specifically buf.toString('ucs2'). So we need alloc_bytes and decode_ucs2 operations.
  3. encode inputs (strings) processing - this is currently handled mostly by converting strings to buffers in UCS2 encoding and then working on them (specifically Buffer.from(str, 'ucs2'). To support that we need a encode_ucs2 operation that would convert a string to a Buffer/Uint8Array. We will then need to make sure the codecs are not using any Buffer-specific stuff on this binary data.
  4. encode output (bytes) creation - here we just need alloc_bytes operation that would return Buffer/Uint8Array of corresponding size. The codecs will then fill it byte-by-byte, .slice() it at the end and return. We'll need to make sure there's no Buffer-specific operations.

We also need a "concat" operation on the Buffer/Uint8Array type in several places.

So this leaves us with the following interface (I think this is the simplest we can do):

// Pseudocode
Bytes = Buffer | Uint8Array   

interface BufferAPI {
    alloc_bytes(size: int): Bytes
    concat_bytes(bufs: Bytes[]): Bytes
    encode_ucs2(str: string): Bytes
    decode_ucs2(b: Bytes): string
}

Now this will probably be enough for iconv-lite codecs, but we also need to think about internal codecs that use Buffer encodings and StringDecoder. They are:

  • utf8 - this needs to be very fast. We can either reimplement it (see CESU codec for similar code), or can introduce encode_utf8 and decode_utf8 to the BufferAPI above. In the browser they can be implemented with TextEncoder/TextDecoder.
  • utf16 - this is handled by encode_ucs2 and decode_ucs2.
  • binary - this is just alias to ASCII
  • base64 and hex - these I think we could remove, given #231. Unfortunately 'utf7' requires 'base64', so we'll have to create a js-only implementation or use an external module. Alternatively, we can add encode_base64 and decode_base64 to BufferAPI, which is a bit sad. For browsers we can use atob and btoa there.

The only hard part here would be streaming utf8, but we can use StringDecoder implementation like you proposed.

As for the tests, I think it would make sense to use the same BufferAPI interface to create corresponding data, convert and check it. A lot of tests use 'hex' encoding to compare bytes, so it might make sense to create test-only helpers to convert between hex and bytes.

One other direction I'm thinking is that (probably after this project), we can make "string" type also switchable to something else (e.g. see #242), plus make string processing more efficient by using Uint16Array-s. This would require more changes to BufferAPI (add RawChars = Uint16Array, alloc_raw_chars(), concat_raw_chars() and make encode_ucs2/decode_ucs2 work with RawChars instead of Bytes).

Does it make any sense? What do you think?

ashtuchkin avatar Jul 11 '20 08:07 ashtuchkin

Overall your plan sounds very solid. I think we can go with it and adjust along the way if need be.

Ideally I would chunk it into smaller pieces which could be done independently. At least for me it makes it easier to move forward when I can complete something within 2-4 hours timeframe when I have some free time.

Would it be possible for you to draft this smaller pieces? At least some initial ones. For example I can look at the implementation of the BufferAPI you've proposed in the browser.

First, let's clearly define a success criteria for this project.

Your criteria sounds exactly right, let's go with it πŸ‘

Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections.

Honestly I don't know how well browser entry point is supported by the tooling. In any case consumers will be able to import the version they need directly.

We also need a "concat" operation on the Buffer/Uint8Array type in several places

Hm, I've thought you removed Buffer.concat usage all together.

The only hard part here would be streaming utf8, but we can use StringDecoder implementation like you proposed.

Actually I completely forgot to mention that TextDecoder supports streaming, so technically we might not even need StringDecoder.

Also I've discovered that Node has support for TextDecoder and TextEncoder APIs from version 8.3.

This in practice means that we get utf8 encode/decode for free in both environments. Of course good for you to double-check since I am no pro in encoding business :) Is

Also TextDecoder in both environments does seem to support a lot of other encodings. I am wondering if it can be used to our benefit. I would at least hypothesize that it is implemented in C and thus can give performance improvements over JavaScript implementation. Anyway this a bit off-topic to the current goal.

gyzerok avatar Jul 12 '20 01:07 gyzerok

Sounds good. I'll create the foundation.

Alexander Shtuchkin

On Sat, Jul 11, 2020 at 9:16 PM Fedor Nezhivoi [email protected] wrote:

Overall your plan sounds very solid. I think we can go with it and adjust along the way if need be.

Ideally I would chunk it into smaller pieces which could be done independently. At least for me it makes it easier to move forward when I can complete something within 2-4 hours timeframe when I have some free time.

Would it be possible for you to draft this smaller pieces? At least some initial ones. For example I can look at the implementation of the BufferAPI you've proposed in the browser.

First, let's clearly define a success criteria for this project.

Your criteria sounds exactly right, let's go with it πŸ‘

Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections.

Honestly I don't know how well browser entry point is supported by the tooling. In any case consumers will be able to import the version they need directly.

We also need a "concat" operation on the Buffer/Uint8Array type in several places

Hm, I've thought you removed Buffer.concat usage all together.

The only hard part here would be streaming utf8, but we can use StringDecoder implementation like you proposed.

Actually I completely forgot to mention that TextDecoder supports streaming, so technically we might not even need StringDecoder.

Also I've discovered that Node has support for TextDecoder https://nodejs.org/api/util.html#util_class_util_textdecoder and TextEncoder https://nodejs.org/api/util.html#util_class_util_textencoder APIs from version 8.3.

This in practice means that we get utf8 encode/decode for free in both environments. Of course good for you to double-check since I am no pro in encoding business :) Is

Also TextDecoder in both environments does seem to support a lot of other encodings. I am wondering if it can be used to our benefit. I would at least hypothesize that it is implemented in C and thus can give performance improvements over JavaScript implementation. Anyway this a bit off-topic to the current goal.

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-657158328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHMTPWQCPKM2HDKEYDLR3EFGFANCNFSM4NPAMD5Q .

ashtuchkin avatar Jul 13 '20 00:07 ashtuchkin

@gyzerok I've created "backends" concept, see e5678496c0637df5591a1b097add9c6770854d3e. This will enable you to help with converting other codecs.

ashtuchkin avatar Jul 14 '20 16:07 ashtuchkin

Awesome! Love the ES6 vibe πŸ˜„

gyzerok avatar Jul 14 '20 21:07 gyzerok

It's been a long time coming :)

Alexander Shtuchkin

On Tue, Jul 14, 2020 at 5:47 PM Fedor Nezhivoi [email protected] wrote:

Awesome! Love the ES6 vibe πŸ˜„

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-658430269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHMZGCKFXXJCMCU3S4DR3TG7BANCNFSM4NPAMD5Q .

ashtuchkin avatar Jul 14 '20 22:07 ashtuchkin

I am thinking about picking up sbcs codec. It looks relatively simple to start with.

gyzerok avatar Jul 15 '20 07:07 gyzerok

Ok, I am going to try mbcs now. Will see if I can figure out whats up and how to migrate it and report here if I fail :)

gyzerok avatar Jul 20 '20 23:07 gyzerok

What about the internal codec? Do you have any specific plans for it's conversion?

I can give it a shot. Just want to make sure I am going to go the right direction :)

gyzerok avatar Jul 27 '20 23:07 gyzerok

For internal codec I wanted to drop hex and base64 encodings; utf-16 is implemented, so the only thing left is utf-8. We should probably add custom functions to the backend to use perf-optimized algos from Buffer and TextEncoder/Decoder. Feel free to give it a shot. I haven't had free time this week, sorry.

ashtuchkin avatar Aug 03 '20 05:08 ashtuchkin

I am now in the middle of traveling, so haven’t had a chance to continue. And my pause is going to last for some more weeks. Will write here once I’ll be available again.

gyzerok avatar Aug 16 '20 05:08 gyzerok

Why restrict to Buffer and Uint8Array? Would using Array<0 | 1 | 2 | ... | 255> cause some problems? The solution below works as a nice abstraction (when you do not have any restrictions on the stored values)

https://github.com/gliese1337/fast-myers-diff/blob/a2da289e88e266d55973f2897b91d19ac00c75fd/src/index.ts#L1

Bounds checking / type checking for each element of the array will slow down things considerably. If you have an array of different type, conversion to a UInt8Array before passing to iconv-lite should be simple enough.

On Thu, Jun 17, 2021, 19:05 Notas Hellout @.***> wrote:

Why restrict to Buffer and Uint8Array? Would using Array<0 | 1 | 2 | ... | 255> cause some problems? The solution below works as a nice abstraction (when you do not have any restrictions on the stored values)

https://github.com/gliese1337/fast-myers-diff/blob/a2da289e88e266d55973f2897b91d19ac00c75fd/src/index.ts#L1

β€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-863615962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHPJHASRLZGRUYJKHGDTTJ5SNANCNFSM4NPAMD5Q .

ashtuchkin avatar Jun 17 '21 23:06 ashtuchkin

For convenient: https://www.npmjs.com/package/to-uint8array

jimmywarting avatar Aug 06 '21 08:08 jimmywarting