jschardet Detect encoding by looking for specific markers

Detect encoding by looking for specific markers

Open bpasero opened this issue 6 years ago • 6 comments

I was wondering if jschardet would ever consider to understand specific markers within a file to get the encoding from. For example, XML can have an encoding in the header:

<?xml version="1.0" encoding="windows-1251"?>

and HTML as well:

<meta charset="..."/>

There may be other languages where this exists too.

Refs: https://github.com/microsoft/vscode/issues/36230

Nov 18 '17 08:11 bpasero

That’s a good question and I did think about it. The reason I ended up not doing it is because it’s not uncommon for the encoding to specify X and then the file actually being stored in Y. It’s a common source of bugs. In this case I would prefer to defer this type of bug detection to an IDE so they could inform the user about it.

I’m happy to re-evalute this though. I’ve noticed you’re working on VSCode, so you could provide a better insight on this. My assumption was also that IDEs would prefer to know the actual byte encoding instead of relying on metadata. Otherwise it’d be impossible (I guess I could always add an option to opt-out of metadata usage) for the IDE to know the real encoding.

Nov 18 '17 14:11 aadsm

Yeah I brought this up because we got some reports from users asking for this feature and we use jschardet when users have enabled auto-guessing of encoding. Maybe this could be an option in jschardet that is not enabled by default.

I see the issue with the actual encoding being different from what is set in the file by the user. On the other hand, isn't the encoding always a guess that can be wrong? So maybe using the hints that are in the file is not a bad idea (at least optionally).

Nov 19 '17 07:11 bpasero

Yeah, the encoding is always a guess but on the premise that it uses the bytes and not the metadata (I created this library originally to detect cyrilic encoding in ID3 tags that reported wrong metadata).

I’m happy to have this feature though. I currently don’t have the time to implement it (maybe during christmas vacation though), do you know if anyone is interested in coding this? I can provide mentoring and guidance if needed.

Nov 19 '17 18:11 aadsm

Have a nice time!

I have a similar problem, but "from the other side": developer uses [email protected] (hi says) and time-to-time (not allways!) they have "MacCyrillic-instead-of-Win1251" error with my files. But headers and bobys are 1251. Do you have sandbox to verify my files (probably I can make any error in my files)?

Sencirelly yours, Dmitry [email protected]

PS Merry Christmass! :-)

Dec 27 '17 15:12 OneLonelly

What do you mean a sandbox to verify your files? You can use runkit to test the library: https://npm.runkit.com/jschardet

Jan 16 '18 07:01 aadsm

jschardet does its job correctly. It detects the most likely encoding from the character codes present in the string. The bug is actually in VSCode because it still keeps asking jschardet completely ignoring the files he's supposed to handle. The XML is a standard. If the standard says that the encoding is specified in a certain way, the encoding is specified. No jschardet is even needed. The IDE has to use it or, at most, if it's a good IDE tell the user

"hey, here there's written that the file is ISO8859 but jscharded is confident that what's in the file is encoded in UTF-8. Did you by any chance save this file with a previous version of Visual Studio Code and it fucked your file up? Do you want me to fix the file for you reconverting back all the characters as they were before VSCode messed with them without you even knowing it?"

Moreover because the user might have the "encoding autodetect" in vscode set to false. In this case visual studio has to OPEN the file in the only encoding he knows which is either determined by the standard (first) or by using the default (when the standard doesn't have it specified clearly) Instead if the autodetect is inactive, VSCode has default encoding A, the OS has B and the file has C, VSCode opens the file with encoding A messes the characters and saves it in A again.....but you ask jschardet to fix this bug it for you...even if jschardet isn't even invoked because the autodetect is false.

And guess what? you don't even use the "encoding detection" during the "replace all" so if i want to replace a tag in an XML file (which doesn't even have chars outside the [a-zA-Z]) "tagA" with "tagB" you do the process above of ignoring everything and damaging ALL the user files at once (because you open the whole file and save the whole file)

But when dozens of users open the bug report in VSCode (because it is) you classify it as a "feature request". 🤡

Oct 12 '21 15:10 LinoBarreca

jschardet jschardet copied to clipboard

Detect encoding by looking for specific markers

jschardet
jschardet copied to clipboard