Notepad2e
Notepad2e copied to clipboard
Invalid file charset detection under Japanese locale
Very similar issue to #269, also tested in XP and Win10. Under Japanese locale, Notepad2 thinks the attached file is Unicode-encoded. Under another locale, it correctly detects it as UTF-8. I found that about 10% of similar text files meet this problem.
I tried to locate a symbol or line that causes the mis-detection but could only determine that individual characters most of the time do not matter and that it happens to some combinations of character + trailing space. It's often enough to change just one symbol (e.g. a multi-byte symbol with a single-byte one) for to "fix" the detection.
Found this entry in Notepad2 FAQ: http://www.flos-freeware.ch/development-releases/notepad2-FAQs.html#unicode-detection
There's a link to MSDN too. Maybe it is related.
That said, is charset detection performed by Win32 API or it can be influenced? I noticed it doesn't detect Shift-JIS correctly (at all, in neither locale) and this should be corrected if not too hard.
Please consider the following:
- IsTextUnicode function is used when trying to identify whether provided text is unicode or not.
- It was found that function works differently on japanese/english locale OS. It results with 0 when using english locale, and 0x402 for japanese locale (which is treated as
IS_TEXT_UNICODE_DBCS_LEADBYTE| IS_TEXT_UNICODE_STATISTICS). - Committed change address specified improper detection of unicode text when using japanese locale and matches M$ recommendation:
The IS_TEXT_UNICODE_STATISTICS and IS_TEXT_UNICODE_REVERSE_STATISTICS tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpv indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable.
- UTF8 detection is implemented directly in Notepad2e code.
- Notepad2 ― Encoding Tutorial should help to resolve the issue with Shift-JIS encoding detection.