notepad4 regex character range with multi-byte characters

regex character range with multi-byte characters

Open Jerkwin opened this issue 3 years ago • 7 comments

notepad2自带的正则表达式引擎速度较慢, 处理中文存在问题, notepad3用的是Oniguruma正则引擎, 效果好些. 能否借鉴一下?

Mar 02 '21 05:03 Jerkwin

目前没有计划使用第三方正则表达式引擎（Oniguruma, Boost, re2, pcre2等）。

Mar 05 '21 13:03 zufuliu

既然如此, 那能不能修正一下正则的\1等不支持中文字符的问题? UTF-8编码下, 可以使用([一-龟])来捕获单个中文字符, 但如果将其替换为\1, 就会出现乱码情况.

Mar 27 '21 03:03 Jerkwin

没有用过\n替换，找时间看一下。

Mar 27 '21 14:03 zufuliu

Fixed by 7d7b4d08995ed992538cf46ba676310f86933003, reported to upstream at https://sourceforge.net/p/scintilla/bugs/2244/

Apr 10 '21 02:04 zufuliu

It's a nice issue. BTW, I think you'd better write in English.

Apr 10 '21 09:04 bluenlive

See Neil's comment at https://sourceforge.net/p/scintilla/bugs/2244/#f523, this "works" by accident, and does not do what is intended to (match any character in U+4E00..U+9FFF CJK Unified Ideographs block).

The proper fix is enable CXX11_REGEX (#undef NO_CXX11_REGEX in Document.cxx or remove NO_CXX11_REGEX from Notepad2 project files) and using SCFIND_CXX11REGEX flag, which will using std::wregex for UTF-8.

Enable CXX11_REGEX will increase binary size by 164 KiB.

Apr 11 '21 08:04 zufuliu

[一-龟] does not work perfectly. It matches most part of chinese characters which are frequently-used in daily life. This is just a trick and it's easier to remember.

Enable CXX11_REGEX is a good solution, the binary size does not matter too much in nowadays.

Apr 11 '21 14:04 Jerkwin

notepad4 notepad4 copied to clipboard

regex character range with multi-byte characters

notepad4
notepad4 copied to clipboard