notepad4 icon indicating copy to clipboard operation
notepad4 copied to clipboard

regex character range with multi-byte characters

Open Jerkwin opened this issue 3 years ago • 7 comments

notepad2自带的正则表达式引擎速度较慢, 处理中文存在问题, notepad3用的是Oniguruma正则引擎, 效果好些. 能否借鉴一下?

Jerkwin avatar Mar 02 '21 05:03 Jerkwin

目前没有计划使用第三方正则表达式引擎(Oniguruma, Boost, re2, pcre2等)。

zufuliu avatar Mar 05 '21 13:03 zufuliu

既然如此, 那能不能修正一下正则的\1等不支持中文字符的问题? UTF-8编码下, 可以使用([一-龟])来捕获单个中文字符, 但如果将其替换为\1, 就会出现乱码情况.

Jerkwin avatar Mar 27 '21 03:03 Jerkwin

没有用过\n替换,找时间看一下。

zufuliu avatar Mar 27 '21 14:03 zufuliu

Fixed by 7d7b4d08995ed992538cf46ba676310f86933003, reported to upstream at https://sourceforge.net/p/scintilla/bugs/2244/

zufuliu avatar Apr 10 '21 02:04 zufuliu

It's a nice issue. BTW, I think you'd better write in English.

bluenlive avatar Apr 10 '21 09:04 bluenlive

See Neil's comment at https://sourceforge.net/p/scintilla/bugs/2244/#f523, this "works" by accident, and does not do what is intended to (match any character in U+4E00..U+9FFF CJK Unified Ideographs block).

The proper fix is enable CXX11_REGEX (#undef NO_CXX11_REGEX in Document.cxx or remove NO_CXX11_REGEX from Notepad2 project files) and using SCFIND_CXX11REGEX flag, which will using std::wregex for UTF-8.

Enable CXX11_REGEX will increase binary size by 164 KiB.

zufuliu avatar Apr 11 '21 08:04 zufuliu

[一-龟] does not work perfectly. It matches most part of chinese characters which are frequently-used in daily life. This is just a trick and it's easier to remember.

Enable CXX11_REGEX is a good solution, the binary size does not matter too much in nowadays.

Jerkwin avatar Apr 11 '21 14:04 Jerkwin