NotepadNext
NotepadNext copied to clipboard
[Bug Report] Two issue with Chinese documents
Enviroments
- OS: Windows 10 21H2 (19044.2130)
- NotepadNext version: latest (and also tested Qt6 version)
- Input method: 搜狗输入法 11.0 正式版 (11.0.0.4895)
Issue 1: Chinese characters seem not to work well under ANSI
When you open a ANSI document
When you create a new *.txt
file
If you directly open NotepadNext, it defaultly use UTF-8 encoding, OK that's good and of course works well.
But, if you create a *.txt
file via right-click menu, then open it with NotepadNext, this file is regarded as ANSI encoding. Then, when you want to input Chinese characters to this file, all them will be replaced with ?
(like #200).
Issue 2: strange behaivor when save a UTF-8 file
Steps:
- Open a UTF-8 encoding file.
- Modify anything, even only add a whitespace.
-
Ctrl+S
to save file.
Before save:
After save:
This file can be downloaded here.
diff --git a/_posts/2021-3-21-use_valine.md b/_posts/2021-3-21-use_valine.md
index 30b447a..749bc17 100644
--- a/_posts/2021-3-21-use_valine.md
+++ b/_posts/2021-3-21-use_valine.md
@@ -12,18 +12,18 @@ tags: [jekyll, 教程, 网站, valine] # TAG names should always be lowercas
根据 [Valine 官方教程](https://valine.js.org/quickstart.html){:target="_blank"}注册 LeanCloud 以获取 APP ID 和 APP Key
。注:注册[国内版 LeanCloud](https://leancloud.cn/){:target="_blank"} 需要绑定已备案的域名,而注册[国际版 LeanCloud](https://leancloud.app/){:target="_blank"} 则不需要。
-如果是 fork 主题搭建博客,修改对应文件即可。如果是使用 theme 或者 remote_theme,则需要下载对应的文件放在相应目录后再修
改。
+如果是 fork 主题搭建博客,修改对应文件即可。如果是使用 theme 或者 remote_tem,则需要下载对应的文件放在縿<BD>目录后再修
改。
-## 配置 `_config.yml`
+## 配置 `_confgyml`
-找到 disqus 数据段并删除:
+找到 disqus 数据段并删除<BF>�
```yml
disqus:
comments: false
shortname: ''
```
-{: file="_config.yml" }
+{: file="cofig.yml" }
增加 valine 的数据段:
So I'll definitely admit I have no clue when it comes to any type of alternative input methods...so I'm kind of lost when it comes to those.
But, if you create a *.txt file via right-click menu
Since it is an empty file it technically has no 'encoding' so it defaults to ansi. For a new document within the application, it defaults to utf-8.
I believe this probably isn't an issue with Notepad++ due to this settings:
Notepad Next would need something similar
strange behaivor when save a UTF-8 file
That definitely is strange. Thanks for providing an example file...I havent had a chance to try it yet but when I get some free time I'll see what I can find out.
Couple other things to mention, there is the Debug Log which shows a little bit of information when trying to determine file encoding.
There is also a very crude hex-viewer built in, but don't think that show's real-time modifications yet...so not sure how helpful that might be currently.
I would like to add that copy and paste chinese characters to a new ANSI document are also displayed as ?
.
I would like to add that copy and paste chinese characters to a new ANSI document are also displayed as ?.
I believe this is expected. ANSI is a single byte encoding, so it does not know how to render multi-byte characters. Notepad++ has this same behavior for ANSI documents.
ANSI is a single byte encoding
From what I understand, ANSI is not a fixed encoding, its specific to the Windows locale. For Simplified Chinese locale, ANSI is GBK; for Japanese locale, ANSI is Shift-JIS; for Traditional Chinese locale, ANSI is Big5.
From what I understand, ANSI is not a fixed encoding
Good to know. I'll need to look deeper into that and see how Notepad++ handles it. To be honest I am not very familiar with encoding, code pages, etc.
Same problem for me. :>
For issue 2, I tried to compare the file before and after saving using hex, and I found that some bytes were missing.
> e6 88 96 e8 80 85 // 或者
< e6 96 80 85
> e7 9a 84 e6 96 87 e4 bb b6 // 的文件
< e7 9a e6 87 bb b6
> e9 85 8d e7 bd ae // 配置
< e9 85 e7 ae
> e5 b9 b6 e5 88 a0 e9 99 a4 // 并删除
< e5 b9 e5 a0 99 a4
> 63 6f 6e 66 69 67 // config
< 63 6f 66 69 67 // cofig
Unfortunately I haven't found any pattern in the missing bytes.
Then I tried the following code:
diff --git a/src/NotepadNext/ScintillaNext.cpp b/src/NotepadNext/ScintillaNext.cpp
index 8133b8e..780e9dc 100644
--- a/src/NotepadNext/ScintillaNext.cpp
+++ b/src/NotepadNext/ScintillaNext.cpp
@@ -173,7 +173,17 @@ bool ScintillaNext::save()
emit aboutToSave();
- bool writeSuccessful = writeToDisk(QByteArray::fromRawData((char*)characterPointer(), textLength()), fileInfo.filePath());
+ auto raw = (char *)characterPointer();
+
+ QFile file("./test_rawdata.txt");
+ file.open(QFile::WriteOnly);
+ QDataStream dts(&file);
+ dts.writeRawData(raw, textLength());
+ file.flush();
+ file.close();
+
+ QByteArray data = QByteArray::fromRawData(raw, textLength());
+ bool writeSuccessful = writeToDisk(data, fileInfo.filePath());
if (writeSuccessful) {
updateTimestamp();
As a result, file saved directly from the raw data pointer is also wrong. So it might be an issue from Scintilla, or at least something went wrong with Scintilla's configuration or API call.
So it might be an issue from Scintilla, or at least something went wrong with Scintilla's configuration or API call.
Agreed. There are ways to set the Code Page and I honestly don't know the best way to handle that.
Might be worth a shot to see if modifying the code page in real time would have any affect (no clue how it affects the currently loaded document. You can do this through the "Lua Console", e.g.:
editor.CodePage = SC_CP_UTF8
-- or something like
editor.CodePage = 936
Hi @dail8859 ,
I must admit that my previous conclusion was wrong. If you delete emit aboutToSave();
, every will be fine.
Doing a more detailed investigation, I found that the problem is inside the function trimTrailingWhitespace
, we can do a simple test:
diff --git a/src/NotepadNext/Finder.cpp b/src/NotepadNext/Finder.cpp
index a8984f8..9b63527 100644
--- a/src/NotepadNext/Finder.cpp
+++ b/src/NotepadNext/Finder.cpp
@@ -151,6 +151,7 @@ int Finder::replaceAll(const QString &replaceText)
total++;
editor->setTargetRange(start, end);
+ qDebug() << "Target byte is " << editor->targetText().toHex();
if (isRegex)
return start + editor->replaceTargetRE(replaceData.length(), replaceData.constData());
else
Application output:
[ 10.174] I: Scintilla::Internal::RegexSearchBase* Scintilla::Internal::CreateRegexSearch(CharClassify*)
[ 10.174] D: Target byte is "ef"
[ 10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.174] D: Target byte is "bd"
[ 10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.174] D: Target byte is "bf"
[ 10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.174] D: Target byte is "e6"
[ 10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.174] D: Target byte is "20"
[ 10.174] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.175] D: Target byte is "20"
[ 10.175] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
[ 10.175] D: Target byte is "27"
[ 10.175] I: virtual const char* QRegexSearch::SubstituteByPosition(Scintilla::Internal::Document*, const char*, Sci::Position*)
So, it is clear that the function trimTrailingWhitespace
replaces these bytes with null causing them to be missing.
I'm not sure if this is correct, but it looks like SCI_FINDTEXT
and SCI_REPLACETARGET
are based on different units.
Concretely, SCI_FINDTEXT
looks to use "character" as units, that is to say any unicode character such as "啊" is treated as one unit. But SCI_REPLACETARGET
looks to use "byte" as units, that means "啊" is three units.
Update:
Now I bet it must be the reason. SCI_FINDTEXT
is based on QRegularExpression
, and QRegularExpression
is indeed in units of characters.
Unfortunately, I don't see QRegularExpression
and std::regex
provide any interface based on byte position, nor do I see Scintilla provide any interface based on character position.
Without changing the framework, we can do something like:
diff --git a/src/NotepadNext/QRegexSearch.cpp b/src/NotepadNext/QRegexSearch.cpp
index 7c0ffd2..dec6287 100644
--- a/src/NotepadNext/QRegexSearch.cpp
+++ b/src/NotepadNext/QRegexSearch.cpp
@@ -68,17 +68,25 @@ Sci::Position QRegexSearch::FindText(Document *doc, Sci::Position minPos, Sci::P
// Only need the first maxPos characters
QByteArray view = QByteArray::fromRawData(doc->BufferPointer(), maxPos);
+ QString stringView = QString::fromUtf8(view);
+
+ // We should also consider other multibyte encodings, so `fromUtf8` is not the final solution
+ minPos = QString::fromUtf8(view.mid(0, minPos)).length();
// Start at minPos, this keeps the position match inline with Scintilla since it thinks it starts at the beginning
- QRegularExpressionMatch m = re.match(view, minPos, QRegularExpression::NormalMatch, QRegularExpression::NoMatchOption);
+ QRegularExpressionMatch m = re.match(stringView, minPos, QRegularExpression::NormalMatch, QRegularExpression::NoMatchOption);
if (!m.hasMatch())
return -1; // No match
match = m;
*length = match.capturedLength(0);
+ // We should also consider other multibyte encodings, so `toUtf8` is not the final solution
+ qsizetype start = stringView.mid(0, match.capturedStart(0)).toUtf8().length();
+ *length = match.captured(0).toUtf8().length();
- return match.capturedStart(0);
+ return start;
}
Using the features that QString::length
returns the count of characters and QDataArray::length
returns the count of bytes, we can glue QRegularExpression and Scintilla together. However, I cannot evaluate the time cost and space cost of this patch.
So, it is clear that the function trimTrailingWhitespace replaces these bytes with null causing them to be missing.
@NichtsHsu Awesome!! Thanks for taking the time to troubleshoot this and track down the root cause.
I personally haven't given much attention to multibyte characters, but I foresee this kind of thing happening frequently now that more non-English users are using the application.
Unfortunately, I don't see QRegularExpression and std::regex provide any interface based on byte position, nor do I see Scintilla provide any interface based on character position.
I'm not sure either, I'd have to do some investigating on it as well.
However, I cannot evaluate the time cost and space cost of this patch.
I'm actually glad to see this comment, this because that means you are thinking of the worst-case scenarios rather than just "this works so leave it." The code is incredibly useful as it starts to point to possible solutions :)
So a bit if a side note for anyone interested, if for some reason QRegularExpression poses too much of a problem or burden then other solutions might have to be evaluated. I know Notepad++ uses Boost but I really don't want to rely on that at all especially when Qt has its own built in regular expression support.
@NichtsHsu
I think I finally got a solution.
https://github.com/dail8859/NotepadNext/commit/388439eab918af8477510141498544595b1ed4d7
Regex searching in general is working pretty smoothly as far as I can tell. Not sure how efficient it is (still might be doing an implicit conversion to QString?) but this is a step in the right direction.
Thanks for your wonderful work, it works fine in my documents.