node-chardet icon indicating copy to clipboard operation
node-chardet copied to clipboard

Unable to Use Bundle in Browser Extension Due to "Input must be a byte array" Exception

Open KoriIku opened this issue 1 year ago • 1 comments

I encountered a challenge while attempting to create a bundle for this repository and integrating it into a browser extension.

I utilized the following command to bundle this repository for use in a browser extension:

browserify index.ts -p [ tsify --noImplicitAny ] -s chardet --detect-globals false -o bundle.js

Subsequently, I invoked it in the following manner, where content represents a string:

chardet.detect(content)

Regrettably, I received an exception that reads:

Input must be a byte array, e.g., Buffer or Uint8Array

Upon thorough investigation of this exception, I discovered that the issue stems from the following code snippet, where the input is expected to be an object. However, in the browser environment, the lack of the Buffer.from method poses a challenge, preventing me from converting the string to a Buffer as demonstrated in the readme.

const isByteArray = (input) => {
    if (input == null || typeof input != 'object')
        return false;
    return isFinite(input.length) && input.length >= 0;
};

While attempting to address this concern, I considered removing this check directly. However, this approach proved impractical as it yielded inaccurate results.

For testing purposes, here is the string I'm using for testing: ÕâÊÇÒ»¸ö²âÊÔ×Ö·û´®

It is a string encoded in GB 2312, mistakenly opened with ISO8859-1. My goal is to determine the correct encoding of this string so that it can be opened correctly.

If the encoding used is correct, then the original content of this string should be visible: "这是一个测试字符串" (This is a test string).

I appreciate your time and assistance in resolving this matter. If there are any suggestions or insights you can provide, it would be immensely valuable. Thank you for your attention to this issue. May I kindly inquire if you could offer any advice or assistance in addressing this challenge? Your expertise would be greatly appreciated.

KoriIku avatar Feb 05 '24 18:02 KoriIku

The reason strings aren't accepted as input is that the byte representation of a string depends on encoding. For example, in UTF-8, the bytes of the string ÕâÊÇÒ»¸ö²âÊÔ×Ö·û´® would be this:

C3 95 C3 A2 C3 8A C3 87 C3 92 C2 BB C2 B8 C3 B6 C2 B2 C3 A2 C3 8A C3 94 C3 97 C3 96 C2 B7 C3 BB C2 B4 C2 AE

However, in ISO8859-1, the bytes of the same string are:

D5 E2 CA C7 D2 BB B8 F6 B2 E2 CA D4 D7 D6 B7 FB B4 AE

If your input was in UTF-8, you could use a TextEncoder to get the bytes; for the special case of ISO8859-1, you could also use Uint8Array.from(text, (x) => x.codePointAt(0)) (because the ISO8859-1 charset is the same as the first 256 characters of Unicode).

In the more general case, you can use a library like iconv-lite to convert between the encodings, like this:

const iconv = await import('https://esm.sh/v135/[email protected]')
const chardet = await import('https://esm.sh/v135/[email protected]')

const text = 'ÕâÊÇÒ»¸ö²âÊÔ×Ö·û´®'
const bytes = iconv.encode(text, 'ISO8859-1')
const results = chardet.analyse(bytes)

for (const { name: encoding, confidence } of results) {
    console.info(encoding, confidence, iconv.decode(bytes, encoding))
}

Output:

Shift_JIS 10 ユ簗ヌメサク簗ヤラヨキ逸ョ
Big5 10 涴岆珨跺聆彸趼睫揹
EUC-JP 10 宸頁匯倖霞編忖憲堪
EUC-KR 10 侶角寧몸꿎桿俚륜눔
GB18030 10 这是一个测试字符串
ASCII 0 ������������������

lionel-rowe avatar Mar 19 '24 11:03 lionel-rowe

I've added a note on strings to the README: https://github.com/runk/node-chardet/pull/103/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R74

Closing.

runk avatar Aug 07 '24 22:08 runk