notepad4
notepad4 copied to clipboard
Change Encoding-Detector to UCHARDET
Mozilla's (u)chardet would generate better results, I would like to switch to this Encoding-Detector.
Uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible. Uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation. https://www.freedesktop.org/wiki/Software/uchardet/ https://github.com/PyYoshi/uchardet
I experienced the excellent recognition rate of UCHARDET in TextPro. I submitted this to Notepad3 on Mar 2019 and it was accepted soon: https://github.com/rizonesoft/Notepad3/issues/973 "With its training ability and its detection parameter in "%", UCHARDET is really superior !"
Thank you!
I'm not interested on encoding detection library:
- UTF-8 is getting more and more poplar, and is the default encoding used by Notepad2 and many other applications (including Windows Notepad on Win10). We already used very fast UTF-8 validation codes (from https://github.com/zwegner/faster-utf8-validator and https://bjoern.hoehrmann.de/utf-8/decoder/dfa/).
- UTF-16 and UTF-32 (not supported, rarely used as storage format/encoding) files must beginning with BOM, not detection needed.
- For other legacy encodings, we currently default to Windows ANSI code page (GetACP()), which is most likely the case.
Good!
The article on https://hsivonen.fi/chardetng/ says Firefox has switched to https://github.com/hsivonen/chardetng (Rust) since Firefox 73, https://github.com/PyYoshi/uchardet has no updates in past year.
Thanks for reminding me! UCHARDET is probably the best Chinese character detector I have used. This is the reason I recommend it to you.
Because their GPL licenses, the updated uchardet can not be used with Notepad2 unless built it as a DLL.
Mozilla's (u)chardet would generate better results...
Here is just one example where it wouldn't: