Notepad2e icon indicating copy to clipboard operation
Notepad2e copied to clipboard

Copying to clipboard in ANSI charset results in bad charset

Open ProgerXP opened this issue 5 years ago • 14 comments
trafficstars

  1. Open a text file in ANSI charset (1251) with Cyrillic characters like ффф
  2. Create a text file with UTF-8 charset
  3. Copy contents of the first file to the second file

As a result, document will have not ффф but some other symbols as if the charset of the first file was different.

This bug doesn't occur in Notepad2.

ProgerXP avatar Jun 02 '20 10:06 ProgerXP

Fixed. Strange changes came from Scintilla 3.11.2 and caused the specified behavior. Rollback.

cshnik avatar Jul 14 '20 19:07 cshnik

Fixed. Strange changes came from Scintilla 3.11.2 and caused the specified behavior. Rollback.

Maybe we should report this to them? This is definitely a bug.

ProgerXP avatar Jul 16 '20 20:07 ProgerXP

It's not a bug, actually changed in Scintilla 3.6.7 https://www.scintilla.org/ScintillaHistory.html

SC_CHARSET_DEFAULT now means code page 1252 on Windows unless a code page is set. This prevents unexpected behaviour and crashes on East Asian systems where default locales are commonly DBCS. Projects which want to default to DBCS code pages in East Asian locales should set the code page and character set explicitly.

See https://sourceforge.net/p/scintilla/bugs/2093/

Drag & Drop ANSI (CF_TEXT) is also dropped in Scintilla 4.3.0/3.20.0, see https://sourceforge.net/p/scintilla/bugs/2151/.

zufuliu avatar Aug 09 '20 09:08 zufuliu

@zufuliu Thanks for bringing this to our attention.

case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

@cshnik Can we avoid changing Scintilla's code here? Notepad2 documents always have some charset specified.so I don't see why it's using 1252 if the statusbar says ANSI (1251). If I switch to UTF-8 then it works. It also works if I switch to OEM (866). Does it consider ANSI codepage "unset"?

ProgerXP avatar Aug 18 '20 20:08 ProgerXP

But most app had changed that line, e.g. I use case SC_CHARSET_DEFAULT: return documentCodePage;. https://github.com/zufuliu/notepad2/blob/master/scintilla/win32/ScintillaWin.cxx#L1460

zufuliu avatar Aug 19 '20 00:08 zufuliu

Can we avoid changing Scintilla's code here? Notepad2 documents always have some charset specified.so I don't see why it's using 1252 if the statusbar says ANSI (1251). If I switch to UTF-8 then it works. It also works if I switch to OEM (866). Does it consider ANSI codepage "unset"?

Currently I'm unable to reproduce original problem. By the way, there is no ANSI (1251) encoding, ANSI is named as ANSI (1252). Please provide detailed steps to reproduce with correct encoding names/modes, default encoding set, etc.

Test build with a rollback for previous fix: Notepad2e-fix-rollback.zip

cshnik avatar Aug 19 '20 18:08 cshnik

But most app had changed that line

I want to avoid modifying Scintilla sources as much as possible. We already have many patches and it's seriously complicating Scintilla updates.

Currently I'm unable to reproduce original problem.

My set-up:

  • Regional options (in Control Panel) has "language for non-Unicode programs" set to Russian
  • Notepad2 shows first entry in "Encoding" list = "ANSI (1251)"
  • open new Notepad 2e window, set Encoding to "ANSI (1251)", enter some text like ффф, copy it (Alt+C), then open another window but set Encoding to UTF-8 (Shift+F8) and Paste (Ctrl+V)

As a result, you will get ôôô. Same thing if you Paste to another Unicode program, e.g. to Firefox. If you switch the buffer from ANSI to UTF-8 then pasting works correctly.

Note: I believe the problem occurs when 1) ANSI is used 2) user's codepage is set to Russian. If you use "Cyrillic (Windows-1251)" which is technically the same as "ANSI (1251)" - the problem disappears. I believe this is because documentCodePage is 0 when ANSI is used (but this is just a supposition).

By the way, there is no ANSI (1251) encoding, ANSI is named as ANSI (1252).

ANSI is a collective term which is the same as Windows codepage (or just cpXXX - cp1251). It stands for SBCS (1-byte character set) and depends on the codepage of the current Windows user that can be changed via Control Panel > Region > Administrative. Usually it matches the user's locale (language) but not necessary.

Test build with a rollback for previous fix:

I see no difference, this version is also triggering this bug.

ProgerXP avatar Aug 21 '20 16:08 ProgerXP

Default charset was not set properly. Fixed.

cshnik avatar Aug 24 '20 16:08 cshnik

Another strange issue. In my described setup, if you set buffer to ANSI (1251) and try entering Cyrillic symbols - you will get question marks (this doesn't happen in Notepad2 that works properly). But setting Cyrillic (Windows-1251) works correctly. And yet these two charsets must be identical, ANSI is Windows-1251 - why this difference?

Maybe we should avoid passing 0 as the charset identifier to Scintilla at all? Even for ANSI we can detect the charset (as the Encoding dialog shows) and so we can pass its identifier explicitly, avoiding all this.

ProgerXP avatar Jan 13 '21 15:01 ProgerXP

Another strange issue. In my described setup, if you set buffer to ANSI (1251) and try entering Cyrillic symbols - you will get question marks (this doesn't happen in Notepad2 that works properly). But setting Cyrillic (Windows-1251) works correctly. And yet these two charsets must be identical, ANSI is Windows-1251 - why this difference?

Changing encoding from ANSI to Windows-1251 caused Scintilla editor to switch internal mode to Unicode (see "Switching the file encoding from "-prompt) which means the text input/drawing processed differently in runtime and internal code page is set to SC_CP_UTF8.

It looks like this is another problem in Scintilla. For some reasons ANSI code page is always treated as 1252, Scintilla docs states the following:

SC_CHARSET_ANSI and SC_CHARSET_DEFAULT specify European Windows code page 1252 unless the code page is set.

While the following code is used in Notepad 2e and most recent Scintilla 3.x Scintilla 3.21.1:

 	case SC_CHARSET_ANSI: return 1252;
 	case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

Switching result for ANSI to CP_ACP (The system default Windows ANSI code page, WideCharToMultiByte doc) has fixed this issue.

 	case SC_CHARSET_ANSI: return CP_ACP;
 	case SC_CHARSET_DEFAULT: return documentCodePage ? documentCodePage : 1252;

Test build: Notepad2e-locale-test-fix-012821.zip

cshnik avatar Jan 27 '21 20:01 cshnik

While the following code is used in Notepad 2e and most recent Scintilla 3.x Scintilla 3.21.1:

Shouldn't we report this discrepancy to them?

ProgerXP avatar Jan 27 '21 20:01 ProgerXP

Shouldn't we report this discrepancy to them?

Probably yes. It looks like this is a rare case since usually applications use Scintilla with charset set to SC_CHARSET_DEFAULT. While Notepad2e specify charset to properly setup required mode.

@zufuliu, do you happen to know about this discrepancy in documentation and behavior for SC_CHARSET_ANSI?

cshnik avatar Jan 31 '21 19:01 cshnik

Another strange issue. In my described setup, if you set buffer to ANSI (1251) and try entering Cyrillic symbols - you will get question marks (this doesn't happen in Notepad2 that works properly).

BTW The same issue with ANSI in Notepad++ with a fix for case SC_CHARSET_DEFAULT: https://www.gitmemory.com/issue/notepad-plus-plus/notepad-plus-plus/5671/496173345

cshnik avatar Jan 31 '21 21:01 cshnik

@cshnik, I don't use SC_CHARSET_ANSI, but patched Scintilla for SC_CHARSET_DEFAULT to case SC_CHARSET_DEFAULT: return documentCodePage;, see https://sourceforge.net/p/scintilla/bugs/2093/#3ee4/1677 and https://github.com/zufuliu/notepad2/blob/master/scintilla/win32/ScintillaWin.cxx#L1471

zufuliu avatar Feb 01 '21 13:02 zufuliu