compile-time-regular-expressions
compile-time-regular-expressions copied to clipboard
ctre gives different result compared with icu and rust
here is the test code :
int test2()
{
using namespace std::literals;
//std::string original = "𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘 𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘";
std::string original = "戦場のヴァルキュリア3";
auto bdata = original.data();
static constexpr auto pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
auto matcher = ctre::search<pattern>;
std::string_view cur_data((char*)original.data(),original.size());
std::vector<std::pair<std::pair<int32_t, int32_t>, bool>> splits;
splits.reserve(original.size());
int prev = 0;
bool is_matched =false;
do {
auto matched = matcher(cur_data);
is_matched = matched;
if (is_matched){
auto start_byte_index = matched.begin() - original.data();
auto end_byte_index = matched.end() - original.data();
if (prev != start_byte_index) {
std::pair<int32_t, int32_t> p(prev, start_byte_index);
splits.push_back(
std::pair<std::pair<int32_t, int32_t>, bool>(p, false));
}
std::pair<int32_t, int32_t> p(start_byte_index, end_byte_index);
splits.push_back(std::pair<std::pair<int32_t, int32_t>, bool>(p,
true));
prev = end_byte_index;
int pos = matched.end() - cur_data.data();
cur_data.remove_prefix(pos);
}
} while(is_matched);
rust and icu give the same result the matched string is "戦場のヴァルキュリア3" and ctre gives two part "戦場のヴァルキュリア" and "3" why that happen?
Can you minimize it?
Yes!
int test2()
{
std::string original = "戦場のヴァルキュリア3";
int size_of_str = original.size(); // size_of_str = 31;
auto bdata = original.data();
static constexpr auto pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
auto matcher = ctre::search<pattern>;
std::string_view cur_data((char*)original.data(),original.size());
int prev = 0;
bool is_matched =false;
do {
auto matched = matcher(cur_data);
is_matched = matched;
if (is_matched){
int pos = matched.end() - cur_data.data();
cur_data.remove_prefix(pos);
}
} while(is_matched);
}
the code give me two matches, one is "戦場のヴァルキュリア" and the other is "3"
but when I do the same regex search using ICU library and rust, they give me one match : "戦場のヴァルキュリア3" so why that happen?
by the way, if I use this string : std::string original = "Media.Vision"; ctre , ICU library and rust, they give same three matches:
- "Media"
- "."
- "Vision"
\w+
in Rust is unicode-aware, it will match any word character in any script (equivalent to [\p{L}\p{N}_]
).
In PCRE it only matches ASCII letters, digits and underscore.
https://regex101.com/r/jVmHsw/1
For a compile-time regex library to be fully Unicode-aware is a huge ask, FYI @DamonsJ. Unicode is incredibly complex, requiring lots of very large lookup-tables and other short-circuiting mechanisms to implement all the code point identification logic correctly and efficiently.
Thanks very much @marzer @iulian-rusu
I know it is hard to fully support for unicode regex!
For my question, I write pattern like this :
static constexpr auto pattern = ctll::fixed_string{
"[\\p{L}\\p{N}\\p{M}\\p{Pc}]+|[^\\p{L}\\p{N}\\p{M}\\p{Pc}\\p{Zs}\\u{A}\\u{B}\\u{C}\\u{D}"
"\\u{85}\\u{2028}\\u{2029}\\u{DA}]+"};
it works for me, but you know it is not exactly same with :
static constexpr auto pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
but hope to help others who has same problems!