jschardet
jschardet copied to clipboard
Character encoding auto-detection in JavaScript (port of python's chardet)
Related to #42 Since we don't need browser bundles for tests anymore, we should drop these bundles from git. We could leverage on the `prepublishOnly` task to automate bundles build...
Detect attached file. The result will be `windows-1252` [shift-jis.txt](https://github.com/aadsm/jschardet/files/879328/shift-jis.txt) 
Guess the encoding on the attached file. It contains emojis but is a fine UTF-8 file. [strip.sh.zip](https://github.com/aadsm/jschardet/files/5558481/strip.sh.zip)
The issue I'm having is because of the degree symbol: UTF-8 \xc2\xb0 http://www.fileformat.info/info/unicode/char/b0/index.htm Below, I include the boiled-down calls. My true testing data sample includes properly formatted XML; but through...
See https://github.com/Microsoft/vscode/issues/33720 Test case ``` #!/bin/sh foo() { echo "starting …" } ``` Ellipsis symbol `…` makes vscode guess cp1252. UTF8 should have higher priority IMO
https://github.com/atom/encoding-selector/issues/65 ### Steps to Reproduce https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar - Open in Atom - Select "Auto Detect" encoding, **Expected behavior:** Detects the encoding of the file as GB18030. `iconv -f GB18030 -t UTF-8...
* file: [Untitled-1.txt](https://github.com/aadsm/jschardet/files/906100/Untitled-1.txt) * output with debug: ``` EUC-TW prober hit error at byte 0 windows-1251 confidence = 0, below negative shortcut threshhold 0.05 UTF-8 not active SHIFT_JIS confidence =...
Detect attached file. The result will be `windows-1252` [iso-8859-1.txt](https://github.com/aadsm/jschardet/files/879334/iso-8859-1.txt) 
Every message that uses the character `ç` next to another Unicode returns a strange character. **Using encode: UTF-8** `çã` Shows how `згo` `çõ` Shows how `уш` This can only be...
[MY EUC-KR DATA](https://github.com/aadsm/jschardet/files/1466005/1.txt) This file has been encoded in `EUC-KR` and it is detected as `ISO-8859-2`. However, `chardet` which is python library detects it correctly as `EUC-KR`.