notepad4
notepad4 copied to clipboard
regex character range with multi-byte characters
notepad2自带的正则表达式引擎速度较慢, 处理中文存在问题, notepad3用的是Oniguruma正则引擎, 效果好些. 能否借鉴一下?
目前没有计划使用第三方正则表达式引擎(Oniguruma, Boost, re2, pcre2等)。
既然如此, 那能不能修正一下正则的\1
等不支持中文字符的问题? UTF-8编码下, 可以使用([一-龟])
来捕获单个中文字符, 但如果将其替换为\1
, 就会出现乱码情况.
没有用过\n
替换,找时间看一下。
Fixed by 7d7b4d08995ed992538cf46ba676310f86933003, reported to upstream at https://sourceforge.net/p/scintilla/bugs/2244/
It's a nice issue. BTW, I think you'd better write in English.
See Neil's comment at https://sourceforge.net/p/scintilla/bugs/2244/#f523, this "works" by accident, and does not do what is intended to (match any character in U+4E00..U+9FFF CJK Unified Ideographs block).
The proper fix is enable CXX11_REGEX (#undef NO_CXX11_REGEX
in Document.cxx or remove NO_CXX11_REGEX from Notepad2 project files) and using SCFIND_CXX11REGEX flag, which will using std::wregex for UTF-8.
Enable CXX11_REGEX will increase binary size by 164 KiB.
[一-龟]
does not work perfectly. It matches most part of chinese characters which are frequently-used in daily life. This is just a trick and it's easier to remember.
Enable CXX11_REGEX is a good solution, the binary size does not matter too much in nowadays.