notepad4 icon indicating copy to clipboard operation
notepad4 copied to clipboard

Change Encoding-Detector to UCHARDET

Open lenny20 opened this issue 3 years ago • 6 comments

Mozilla's (u)chardet would generate better results, I would like to switch to this Encoding-Detector.

Uchardet is an encoding detector library, which takes a sequence of bytes in an unknown character encoding without any additional information, and attempts to determine the encoding of the text. Returned encoding names are iconv-compatible. Uchardet started as a C language binding of the original C++ implementation of the universal charset detection library by Mozilla. It can now detect more charsets, and more reliably than the original implementation. https://www.freedesktop.org/wiki/Software/uchardet/ https://github.com/PyYoshi/uchardet

I experienced the excellent recognition rate of UCHARDET in TextPro. I submitted this to Notepad3 on Mar 2019 and it was accepted soon: https://github.com/rizonesoft/Notepad3/issues/973 "With its training ability and its detection parameter in "%", UCHARDET is really superior !"

Thank you!

lenny20 avatar May 20 '21 20:05 lenny20

I'm not interested on encoding detection library:

  1. UTF-8 is getting more and more poplar, and is the default encoding used by Notepad2 and many other applications (including Windows Notepad on Win10). We already used very fast UTF-8 validation codes (from https://github.com/zwegner/faster-utf8-validator and https://bjoern.hoehrmann.de/utf-8/decoder/dfa/).
  2. UTF-16 and UTF-32 (not supported, rarely used as storage format/encoding) files must beginning with BOM, not detection needed.
  3. For other legacy encodings, we currently default to Windows ANSI code page (GetACP()), which is most likely the case.

zufuliu avatar May 21 '21 14:05 zufuliu

Good!

lenny20 avatar May 21 '21 16:05 lenny20

The article on https://hsivonen.fi/chardetng/ says Firefox has switched to https://github.com/hsivonen/chardetng (Rust) since Firefox 73, https://github.com/PyYoshi/uchardet has no updates in past year.

zufuliu avatar May 24 '21 15:05 zufuliu

Thanks for reminding me! UCHARDET is probably the best Chinese character detector I have used. This is the reason I recommend it to you.

lenny20 avatar May 24 '21 18:05 lenny20

Because their GPL licenses, the updated uchardet can not be used with Notepad2 unless built it as a DLL.

zufuliu avatar May 25 '21 12:05 zufuliu

Mozilla's (u)chardet would generate better results...

Here is just one example where it wouldn't:

test.txt

levicki avatar Dec 21 '21 13:12 levicki