NotepadNext [Bug Report] Two issue with Chinese documents

Enviroments

OS: Windows 10 21H2 (19044.2130)
NotepadNext version: latest (and also tested Qt6 version)
Input method: 搜狗输入法 11.0 正式版 (11.0.0.4895)

Issue 1: Chinese characters seem not to work well under ANSI

When you open a ANSI document

When you create a new `*.txt` file

If you directly open NotepadNext, it defaultly use UTF-8 encoding, OK that's good and of course works well.

But, if you create a *.txt file via right-click menu, then open it with NotepadNext, this file is regarded as ANSI encoding. Then, when you want to input Chinese characters to this file, all them will be replaced with ? (like #200).

Issue 2: strange behaivor when save a UTF-8 file

Steps:

Open a UTF-8 encoding file.
Modify anything, even only add a whitespace.
Ctrl+S to save file.

Before save:

After save:

This file can be downloaded here.

diff --git a/_posts/2021-3-21-use_valine.md b/_posts/2021-3-21-use_valine.md
index 30b447a..749bc17 100644
--- a/_posts/2021-3-21-use_valine.md
+++ b/_posts/2021-3-21-use_valine.md
@@ -12,18 +12,18 @@ tags: [jekyll, 教程, 网站, valine]     # TAG names should always be lowercas

 根据 [Valine 官方教程](https://valine.js.org/quickstart.html){:target="_blank"}注册 LeanCloud 以获取 APP ID 和 APP Key
。注：注册[国内版 LeanCloud](https://leancloud.cn/){:target="_blank"} 需要绑定已备案的域名，而注册[国际版 LeanCloud](https://leancloud.app/){:target="_blank"} 则不需要。

-如果是 fork 主题搭建博客，修改对应文件即可。如果是使用 theme 或者 remote_theme，则需要下载对应的文件放在相应目录后再修
改。
+如果是 fork 主题搭建博客，修改对应文件即可。如果是使用 theme 或者 remote_tem，则需要下载对应的文件放在縿<BD>目录后再修
改。

-## 配置 `_config.yml`
+## 配置 `_confgyml`

-找到 disqus 数据段并删除：
+找到 disqus 数据段并删除<BF>�

 ```yml
 disqus:
   comments: false
   shortname: ''
 ```
-{: file="_config.yml" }
+{: file="cofig.yml" }

 增加 valine 的数据段：

Oct 14 '22 02:10 NichtsHsu

So I'll definitely admit I have no clue when it comes to any type of alternative input methods...so I'm kind of lost when it comes to those.

But, if you create a *.txt file via right-click menu

Since it is an empty file it technically has no 'encoding' so it defaults to ansi. For a new document within the application, it defaults to utf-8.

I believe this probably isn't an issue with Notepad++ due to this settings:

Notepad Next would need something similar

strange behaivor when save a UTF-8 file

That definitely is strange. Thanks for providing an example file...I havent had a chance to try it yet but when I get some free time I'll see what I can find out.

Oct 14 '22 03:10 dail8859

Couple other things to mention, there is the Debug Log which shows a little bit of information when trying to determine file encoding.

There is also a very crude hex-viewer built in, but don't think that show's real-time modifications yet...so not sure how helpful that might be currently.

Oct 14 '22 03:10 dail8859

I would like to add that copy and paste chinese characters to a new ANSI document are also displayed as ?.

Oct 14 '22 04:10 NichtsHsu

I would like to add that copy and paste chinese characters to a new ANSI document are also displayed as ?.

I believe this is expected. ANSI is a single byte encoding, so it does not know how to render multi-byte characters. Notepad++ has this same behavior for ANSI documents.

Oct 14 '22 11:10 dail8859

ANSI is a single byte encoding

From what I understand, ANSI is not a fixed encoding, its specific to the Windows locale. For Simplified Chinese locale, ANSI is GBK; for Japanese locale, ANSI is Shift-JIS; for Traditional Chinese locale, ANSI is Big5.

Oct 14 '22 11:10 NichtsHsu

From what I understand, ANSI is not a fixed encoding

Good to know. I'll need to look deeper into that and see how Notepad++ handles it. To be honest I am not very familiar with encoding, code pages, etc.

Oct 14 '22 11:10 dail8859

Same problem for me. :>

Oct 19 '22 01:10 armink

For issue 2, I tried to compare the file before and after saving using hex, and I found that some bytes were missing.

> e6 88 96 e8 80 85            // 或者
< e6 96 80 85
> e7 9a 84 e6 96 87 e4 bb b6   // 的文件
< e7 9a e6 87 bb b6
> e9 85 8d e7 bd ae            // 配置
< e9 85 e7 ae
> e5 b9 b6 e5 88 a0 e9 99 a4   // 并删除
< e5 b9 e5 a0 99 a4
> 63 6f 6e 66 69 67            // config
< 63 6f 66 69 67               // cofig

Unfortunately I haven't found any pattern in the missing bytes.

Then I tried the following code:

diff --git a/src/NotepadNext/ScintillaNext.cpp b/src/NotepadNext/ScintillaNext.cpp
index 8133b8e..780e9dc 100644
--- a/src/NotepadNext/ScintillaNext.cpp
+++ b/src/NotepadNext/ScintillaNext.cpp
@@ -173,7 +173,17 @@ bool ScintillaNext::save()
 
     emit aboutToSave();
 
-    bool writeSuccessful = writeToDisk(QByteArray::fromRawData((char*)characterPointer(), textLength()), fileInfo.filePath());
+    auto raw = (char *)characterPointer();
+
+    QFile file("./test_rawdata.txt");
+    file.open(QFile::WriteOnly);
+    QDataStream dts(&file);
+    dts.writeRawData(raw, textLength());
+    file.flush();
+    file.close();
+
+    QByteArray data = QByteArray::fromRawData(raw, textLength());
+    bool writeSuccessful = writeToDisk(data, fileInfo.filePath());
 
     if (writeSuccessful) {
         updateTimestamp();

As a result, file saved directly from the raw data pointer is also wrong. So it might be an issue from Scintilla, or at least something went wrong with Scintilla's configuration or API call.

Oct 21 '22 06:10 NichtsHsu

So it might be an issue from Scintilla, or at least something went wrong with Scintilla's configuration or API call.

Agreed. There are ways to set the Code Page and I honestly don't know the best way to handle that.

Might be worth a shot to see if modifying the code page in real time would have any affect (no clue how it affects the currently loaded document. You can do this through the "Lua Console", e.g.:

editor.CodePage = SC_CP_UTF8
-- or something like
editor.CodePage = 936

Oct 21 '22 11:10 dail8859

Hi @dail8859 ,

I must admit that my previous conclusion was wrong. If you delete emit aboutToSave();, every will be fine.

Doing a more detailed investigation, I found that the problem is inside the function trimTrailingWhitespace, we can do a simple test:

diff --git a/src/NotepadNext/Finder.cpp b/src/NotepadNext/Finder.cpp
index a8984f8..9b63527 100644
--- a/src/NotepadNext/Finder.cpp
+++ b/src/NotepadNext/Finder.cpp
@@ -151,6 +151,7 @@ int Finder::replaceAll(const QString &replaceText)
         total++;
         editor->setTargetRange(start, end);
 
+        qDebug() << "Target byte is " << editor->targetText().toHex();
         if (isRegex)
             return start + editor->replaceTargetRE(replaceData.length(), replaceData.constData());
         else

Application output:

[    10.174] I: Scintilla::Internal::RegexSearchBase* Scintilla::Internal::CreateRegexSearch(CharClassify*)
[    10.174] D: Target byte is  "ef"
[    10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.174] D: Target byte is  "bd"
[    10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.174] D: Target byte is  "bf"
[    10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.174] D: Target byte is  "e6"
[    10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.174] D: Target byte is  "20"
[    10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.175] D: Target byte is  "20"
[    10.175] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[    10.175] D: Target byte is  "27"
[    10.175] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)

So, it is clear that the function trimTrailingWhitespace replaces these bytes with null causing them to be missing.

Oct 24 '22 07:10 NichtsHsu

I'm not sure if this is correct, but it looks like SCI_FINDTEXT and SCI_REPLACETARGET are based on different units.

Concretely, SCI_FINDTEXT looks to use "character" as units, that is to say any unicode character such as "啊" is treated as one unit. But SCI_REPLACETARGET looks to use "byte" as units, that means "啊" is three units.

Update:

Now I bet it must be the reason. SCI_FINDTEXTis based on QRegularExpression, and QRegularExpression is indeed in units of characters.

Oct 24 '22 08:10 NichtsHsu

Unfortunately, I don't see QRegularExpression and std::regex provide any interface based on byte position, nor do I see Scintilla provide any interface based on character position.

Without changing the framework, we can do something like:

diff --git a/src/NotepadNext/QRegexSearch.cpp b/src/NotepadNext/QRegexSearch.cpp
index 7c0ffd2..dec6287 100644
--- a/src/NotepadNext/QRegexSearch.cpp
+++ b/src/NotepadNext/QRegexSearch.cpp
@@ -68,17 +68,25 @@ Sci::Position QRegexSearch::FindText(Document *doc, Sci::Position minPos, Sci::P
 
     // Only need the first maxPos characters
     QByteArray view = QByteArray::fromRawData(doc->BufferPointer(), maxPos);
+    QString stringView = QString::fromUtf8(view);
+
+    // We should also consider other multibyte encodings, so `fromUtf8` is not the final solution
+    minPos = QString::fromUtf8(view.mid(0, minPos)).length();
 
     // Start at minPos, this keeps the position match inline with Scintilla since it thinks it starts at the beginning
-    QRegularExpressionMatch m = re.match(view, minPos, QRegularExpression::NormalMatch, QRegularExpression::NoMatchOption);
+    QRegularExpressionMatch m = re.match(stringView, minPos, QRegularExpression::NormalMatch, QRegularExpression::NoMatchOption);
 
     if (!m.hasMatch())
         return -1; // No match
 
     match = m;
     *length = match.capturedLength(0);
+    // We should also consider other multibyte encodings, so `toUtf8` is not the final solution
+    qsizetype start = stringView.mid(0, match.capturedStart(0)).toUtf8().length();
+    *length = match.captured(0).toUtf8().length();
 
-    return match.capturedStart(0);
+    return start;
 }

Using the features that QString::length returns the count of characters and QDataArray::length returns the count of bytes, we can glue QRegularExpression and Scintilla together. However, I cannot evaluate the time cost and space cost of this patch.

Oct 24 '22 10:10 NichtsHsu

So, it is clear that the function trimTrailingWhitespace replaces these bytes with null causing them to be missing.

@NichtsHsu Awesome!! Thanks for taking the time to troubleshoot this and track down the root cause.

I personally haven't given much attention to multibyte characters, but I foresee this kind of thing happening frequently now that more non-English users are using the application.

Unfortunately, I don't see QRegularExpression and std::regex provide any interface based on byte position, nor do I see Scintilla provide any interface based on character position.

I'm not sure either, I'd have to do some investigating on it as well.

However, I cannot evaluate the time cost and space cost of this patch.

I'm actually glad to see this comment, this because that means you are thinking of the worst-case scenarios rather than just "this works so leave it." The code is incredibly useful as it starts to point to possible solutions :)

So a bit if a side note for anyone interested, if for some reason QRegularExpression poses too much of a problem or burden then other solutions might have to be evaluated. I know Notepad++ uses Boost but I really don't want to rely on that at all especially when Qt has its own built in regular expression support.

Oct 24 '22 11:10 dail8859

@NichtsHsu

I think I finally got a solution.

https://github.com/dail8859/NotepadNext/commit/388439eab918af8477510141498544595b1ed4d7

Regex searching in general is working pretty smoothly as far as I can tell. Not sure how efficient it is (still might be doing an implicit conversion to QString?) but this is a step in the right direction.

Nov 18 '22 04:11 dail8859

Thanks for your wonderful work, it works fine in my documents.

Nov 18 '22 05:11 NichtsHsu

NotepadNext NotepadNext copied to clipboard

[Bug Report] Two issue with Chinese documents

Enviroments

Issue 1: Chinese characters seem not to work well under ANSI

When you open a ANSI document

When you create a new *.txt file

Issue 2: strange behaivor when save a UTF-8 file

NotepadNext
NotepadNext copied to clipboard

When you create a new `*.txt` file