iconv-lite
iconv-lite copied to clipboard
Support using Uint8Array as the "bytes" data type (instead of Buffer)
This will be helpful with browser usage (no need to ship Buffer to the browser) and would make it possible to support WHATWG Encoding interface.
- [x] Introduce a concept of "Backend", with platform-specific functions. Node backend will refer to Buffer, Web backend would use Uint8Array-s.
- [x] Add testing infrastructure to test encodings in all backends and in browsers.
- [ ] Convert all encodings to use backends.
This would be a requirement for VSCode to ship iconv-lite
as part of the web experience π
I've made an attempt for step 1 in the linked PR. If it is not at all what was expected, feel free to close it :)
After another look at the codebase, I think we'd need something more involved than what I described above. Current usages of Buffer class/instances are the following:
Migratable:
- Buffer.from([bytes]) -> Uint8Array.from([bytes])
- Buffer.alloc(n) -> new Uint8Array(n)
- Buffer.alloc(n, default) -> new Uint8Array(n); buf.fill(default)
- Buffer.concat([buf, buf]) -> helper function creating a new buf and copy.
- buf[i] -> arr[i]
- buf.slice(a, b) -> arr.slice(a, b)
Tricky:
- Buffer.from(str, encoding), where encoding is 'utf8', cesu, ucs2, 'hex'. base64.
- buf.toString(encoding) // same encodings as above
- StringDecoder class that is Node.js-specific.
I'll need to think more about it, but for now it looks like just replacing Buffer with Uint8Array would not help (specifically buf.toString(encoding) would be tricky). We probably need to create a different class that would provide hooks for all the functionality above and have a Buffer-based implementation and a Uint8Array-based one. We'll also likely need to involve TextDecoder/TextEncoder classes to support "native" encodings.
@ashtuchkin if it helps, feel free to copy anything from our VSBuffer
helper that is meant as a drop-in replacement for Buffer
to work in node.js and browser environments. The idea being that it will actually use node.js Buffer
if available and otherwise fallback to Uint8Array
: https://github.com/microsoft/vscode/blob/master/src/vs/base/common/buffer.ts
@ashtuchkin which versions of browsers/node are you aiming to support in the library?
Node versions 0.10+ Browsers - I don't have a good sense for now. I'll probably target the ones that have native Uint8Array-s. TextEncoder/TextDecoder-s can probably be polyfilled as needed.
Hello @ashtuchkin!
Big thanks to you for finishing a lot of work already to make this migration possible! π
I would like to help you moving forward with this. But before doing any PRs it would be great to get your feedback on the following plan.
The plan
Step 1
Change tests so that they don't expect Buffer
as an output => we can gradually migrate iconv-lite
to use Uint8Array
exclusively.
Here is an example of what it would mean:
-assert.strictEqual(iconv.encode(testStringBase64, "base64").toString("binary"), testString);
+assert.strictEqual(Buffer.from(iconv.encode(testStringBase64, "base64")).toString("binary"), testString);
Step 2
As you suggested, create our own light module with helpers to do things like buffer.toString(encoding)
and Buffer.from(str, encoding)
and migrate everything to use it instead of Buffer
.
Initially I suggest we just draft the API and internally still use Buffer
. This will make it a single place to be modified when doing actual migration away from Buffer
.
So initial implementation could be something like
function toString(buffer, encoding) {
if (!Buffer.isBuffer(buffer)) {
buffer = Buffer.from(buffer);
}
return buffer.toString(encoding);
}
function toBuffer(str, encoding) {
return Buffer.from(str, encoding);
}
Step 3
Node has a pure JavaScript implementation for StringDecoder
with a single exception - it uses Buffer
. We can now copy it and change Buffer
usage to our own helpers.
Step 4
Looks like we are all set now and the last thing we need to do is to implement our helpers for the Buffer
-less environments.
Step 5
Try to change tests so that the same suite runs both in Node and in the browser. The only tricky part here I see is your usage of iconv
, which we won't be able to run in the browser. Maybe there is a way to pregenerate fixtures with it before tests, so we can simple use them?
First steps
I think that steps 1 and 2 could be actually done in parallel in a way that 1 module is selected and both internal Buffer
usages and tests are changed for it. And personally I would prefer this way of doing things, because: a) it feels more fun to me b) we can parallelize work on different codecs if you are up for it.
If all sounds good to you I suggest that I make my first PR doing steps 1 and 2 for SBCS codec.
Other
Also given your target versions for support I think we can freely get rid of safer-buffer
now.
Thank you for the kind words! and I'm happy you'd like to move forward with this, lets do it :)
The plan overall looks reasonable. Here's a couple of thoughts I had for this migration:
First, let's clearly define a success criteria for this project. For me it's the following: 1. Make iconv-lite work in the browser without Buffer shims (use Uint8Array as the 'binary data' type). 2. Make all tests work in the browser and add a Travis CI integration for it. 3. No regression wrt Node.js environments. Is it what you're also thinking? Did I miss anything?
Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections. I see it something like this:
// lib/node-index.js // node.js entrypoint; set as "main" field in package.json
var nodeBufferAPI = require("./node-api")
module.exports = require("./index-common.js")(nodeBufferAPI)
// lib/web-index.js // browser entrypoint; set as "browser" field in package.json
var webBufferAPI = require("./web-api")
module.exports = require("./index-common.js")(webBufferAPI)
// lib/index-common.js
module.exports = function(bufferAPI) {
// initialize iconv-lite like it's happening today
return iconv;
}
Third, what's this bufferAPI
? In the code we have the following cases:
-
decode
input (bytes) processing - this already supports both Buffer and Uint8Array after our recent commits. -
decode
output (string) creation - currently this mostly works by allocating a Buffer/Uint8Array, filling it, then slice() it and decode as UCS2 (specificallybuf.toString('ucs2')
. So we needalloc_bytes
anddecode_ucs2
operations. -
encode
inputs (strings) processing - this is currently handled mostly by converting strings to buffers in UCS2 encoding and then working on them (specificallyBuffer.from(str, 'ucs2')
. To support that we need aencode_ucs2
operation that would convert a string to a Buffer/Uint8Array. We will then need to make sure the codecs are not using any Buffer-specific stuff on this binary data. -
encode
output (bytes) creation - here we just needalloc_bytes
operation that would return Buffer/Uint8Array of corresponding size. The codecs will then fill it byte-by-byte,.slice()
it at the end and return. We'll need to make sure there's no Buffer-specific operations.
We also need a "concat" operation on the Buffer/Uint8Array type in several places.
So this leaves us with the following interface (I think this is the simplest we can do):
// Pseudocode
Bytes = Buffer | Uint8Array
interface BufferAPI {
alloc_bytes(size: int): Bytes
concat_bytes(bufs: Bytes[]): Bytes
encode_ucs2(str: string): Bytes
decode_ucs2(b: Bytes): string
}
Now this will probably be enough for iconv-lite codecs, but we also need to think about internal codecs that use Buffer encodings and StringDecoder. They are:
- utf8 - this needs to be very fast. We can either reimplement it (see CESU codec for similar code), or can introduce
encode_utf8
anddecode_utf8
to the BufferAPI above. In the browser they can be implemented with TextEncoder/TextDecoder. - utf16 - this is handled by
encode_ucs2
anddecode_ucs2
. - binary - this is just alias to ASCII
- base64 and hex - these I think we could remove, given #231. Unfortunately 'utf7' requires 'base64', so we'll have to create a js-only implementation or use an external module. Alternatively, we can add
encode_base64
anddecode_base64
to BufferAPI, which is a bit sad. For browsers we can useatob
andbtoa
there.
The only hard part here would be streaming utf8, but we can use StringDecoder implementation like you proposed.
As for the tests, I think it would make sense to use the same BufferAPI interface to create corresponding data, convert and check it. A lot of tests use 'hex' encoding to compare bytes, so it might make sense to create test-only helpers to convert between hex and bytes.
One other direction I'm thinking is that (probably after this project), we can make "string" type also switchable to something else (e.g. see #242), plus make string processing more efficient by using Uint16Array-s. This would require more changes to BufferAPI (add RawChars = Uint16Array
, alloc_raw_chars()
, concat_raw_chars()
and make encode_ucs2/decode_ucs2
work with RawChars
instead of Bytes).
Does it make any sense? What do you think?
Overall your plan sounds very solid. I think we can go with it and adjust along the way if need be.
Ideally I would chunk it into smaller pieces which could be done independently. At least for me it makes it easier to move forward when I can complete something within 2-4 hours timeframe when I have some free time.
Would it be possible for you to draft this smaller pieces? At least some initial ones. For example I can look at the implementation of the BufferAPI
you've proposed in the browser.
First, let's clearly define a success criteria for this project.
Your criteria sounds exactly right, let's go with it π
Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections.
Honestly I don't know how well browser
entry point is supported by the tooling. In any case consumers will be able to import the version they need directly.
We also need a "concat" operation on the Buffer/Uint8Array type in several places
Hm, I've thought you removed Buffer.concat
usage all together.
The only hard part here would be streaming utf8, but we can use
StringDecoder
implementation like you proposed.
Actually I completely forgot to mention that TextDecoder
supports streaming, so technically we might not even need StringDecoder
.
Also I've discovered that Node has support for TextDecoder
and TextEncoder
APIs from version 8.3.
This in practice means that we get utf8 encode/decode for free in both environments. Of course good for you to double-check since I am no pro in encoding business :) Is
Also TextDecoder
in both environments does seem to support a lot of other encodings. I am wondering if it can be used to our benefit. I would at least hypothesize that it is implemented in C and thus can give performance improvements over JavaScript implementation. Anyway this a bit off-topic to the current goal.
Sounds good. I'll create the foundation.
Alexander Shtuchkin
On Sat, Jul 11, 2020 at 9:16 PM Fedor Nezhivoi [email protected] wrote:
Overall your plan sounds very solid. I think we can go with it and adjust along the way if need be.
Ideally I would chunk it into smaller pieces which could be done independently. At least for me it makes it easier to move forward when I can complete something within 2-4 hours timeframe when I have some free time.
Would it be possible for you to draft this smaller pieces? At least some initial ones. For example I can look at the implementation of the BufferAPI you've proposed in the browser.
First, let's clearly define a success criteria for this project.
Your criteria sounds exactly right, let's go with it π
Second, to make webpack/browserify not include Buffer shim, we need to be careful about dependencies and dependency injections.
Honestly I don't know how well browser entry point is supported by the tooling. In any case consumers will be able to import the version they need directly.
We also need a "concat" operation on the Buffer/Uint8Array type in several places
Hm, I've thought you removed Buffer.concat usage all together.
The only hard part here would be streaming utf8, but we can use StringDecoder implementation like you proposed.
Actually I completely forgot to mention that TextDecoder supports streaming, so technically we might not even need StringDecoder.
Also I've discovered that Node has support for TextDecoder https://nodejs.org/api/util.html#util_class_util_textdecoder and TextEncoder https://nodejs.org/api/util.html#util_class_util_textencoder APIs from version 8.3.
This in practice means that we get utf8 encode/decode for free in both environments. Of course good for you to double-check since I am no pro in encoding business :) Is
Also TextDecoder in both environments does seem to support a lot of other encodings. I am wondering if it can be used to our benefit. I would at least hypothesize that it is implemented in C and thus can give performance improvements over JavaScript implementation. Anyway this a bit off-topic to the current goal.
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-657158328, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHMTPWQCPKM2HDKEYDLR3EFGFANCNFSM4NPAMD5Q .
@gyzerok I've created "backends" concept, see e5678496c0637df5591a1b097add9c6770854d3e. This will enable you to help with converting other codecs.
Awesome! Love the ES6 vibe π
It's been a long time coming :)
Alexander Shtuchkin
On Tue, Jul 14, 2020 at 5:47 PM Fedor Nezhivoi [email protected] wrote:
Awesome! Love the ES6 vibe π
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-658430269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHMZGCKFXXJCMCU3S4DR3TG7BANCNFSM4NPAMD5Q .
I am thinking about picking up sbcs
codec. It looks relatively simple to start with.
Ok, I am going to try mbcs
now. Will see if I can figure out whats up and how to migrate it and report here if I fail :)
What about the internal codec? Do you have any specific plans for it's conversion?
I can give it a shot. Just want to make sure I am going to go the right direction :)
For internal codec I wanted to drop hex and base64 encodings; utf-16 is implemented, so the only thing left is utf-8. We should probably add custom functions to the backend to use perf-optimized algos from Buffer and TextEncoder/Decoder. Feel free to give it a shot. I haven't had free time this week, sorry.
I am now in the middle of traveling, so havenβt had a chance to continue. And my pause is going to last for some more weeks. Will write here once Iβll be available again.
Why restrict to Buffer
and Uint8Array
? Would using Array<0 | 1 | 2 | ... | 255>
cause some problems? The solution below works as a nice abstraction (when you do not have any restrictions on the stored values)
https://github.com/gliese1337/fast-myers-diff/blob/a2da289e88e266d55973f2897b91d19ac00c75fd/src/index.ts#L1
Bounds checking / type checking for each element of the array will slow down things considerably. If you have an array of different type, conversion to a UInt8Array before passing to iconv-lite should be simple enough.
On Thu, Jun 17, 2021, 19:05 Notas Hellout @.***> wrote:
Why restrict to Buffer and Uint8Array? Would using Array<0 | 1 | 2 | ... | 255> cause some problems? The solution below works as a nice abstraction (when you do not have any restrictions on the stored values)
https://github.com/gliese1337/fast-myers-diff/blob/a2da289e88e266d55973f2897b91d19ac00c75fd/src/index.ts#L1
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/235#issuecomment-863615962, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHPJHASRLZGRUYJKHGDTTJ5SNANCNFSM4NPAMD5Q .
For convenient: https://www.npmjs.com/package/to-uint8array