compile-time-regular-expressions icon indicating copy to clipboard operation
compile-time-regular-expressions copied to clipboard

ctre gives different result compared with icu and rust

Open DamonsJ opened this issue 1 year ago • 6 comments

here is the test code :

int test2()
{
    using namespace std::literals;
    //std::string original = "𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘 𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘";
    std::string original = "戦場のヴァルキュリア3";
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;
    
    std::string_view cur_data((char*)original.data(),original.size());
    
    std::vector<std::pair<std::pair<int32_t, int32_t>, bool>> splits;
    splits.reserve(original.size());
    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){
            
            auto start_byte_index =  matched.begin() - original.data();
            auto end_byte_index =  matched.end() - original.data();
            
            
            if (prev != start_byte_index) {
                std::pair<int32_t, int32_t> p(prev, start_byte_index);
                splits.push_back(
                                 std::pair<std::pair<int32_t, int32_t>, bool>(p, false));
            }
            std::pair<int32_t, int32_t> p(start_byte_index, end_byte_index);
            splits.push_back(std::pair<std::pair<int32_t, int32_t>, bool>(p,
                                                                          true));
            prev = end_byte_index;
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);
    

rust and icu give the same result the matched string is "戦場のヴァルキュリア3" and ctre gives two part "戦場のヴァルキュリア" and "3" why that happen?

DamonsJ avatar Jun 26 '23 02:06 DamonsJ

Can you minimize it?

hanickadot avatar Jun 26 '23 04:06 hanickadot

Yes!

int test2()
{
   
    std::string original = "戦場のヴァルキュリア3";
    int size_of_str = original.size(); // size_of_str = 31;
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;
    std::string_view cur_data((char*)original.data(),original.size());

    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);
}

the code give me two matches, one is "戦場のヴァルキュリア" and the other is "3"

but when I do the same regex search using ICU library and rust, they give me one match : "戦場のヴァルキュリア3" so why that happen?

DamonsJ avatar Jun 26 '23 05:06 DamonsJ

by the way, if I use this string : std::string original = "Media.Vision"; ctre , ICU library and rust, they give same three matches:

  1. "Media"
  2. "."
  3. "Vision"

DamonsJ avatar Jun 26 '23 06:06 DamonsJ

\w+ in Rust is unicode-aware, it will match any word character in any script (equivalent to [\p{L}\p{N}_]). In PCRE it only matches ASCII letters, digits and underscore.

https://regex101.com/r/jVmHsw/1

iulian-rusu avatar Jun 28 '23 20:06 iulian-rusu

For a compile-time regex library to be fully Unicode-aware is a huge ask, FYI @DamonsJ. Unicode is incredibly complex, requiring lots of very large lookup-tables and other short-circuiting mechanisms to implement all the code point identification logic correctly and efficiently.

marzer avatar Jun 28 '23 22:06 marzer

Thanks very much @marzer @iulian-rusu

I know it is hard to fully support for unicode regex!

For my question, I write pattern like this :

static constexpr auto pattern = ctll::fixed_string{
        "[\\p{L}\\p{N}\\p{M}\\p{Pc}]+|[^\\p{L}\\p{N}\\p{M}\\p{Pc}\\p{Zs}\\u{A}\\u{B}\\u{C}\\u{D}"
        "\\u{85}\\u{2028}\\u{2029}\\u{DA}]+"};

it works for me, but you know it is not exactly same with :

static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};

but hope to help others who has same problems!

DamonsJ avatar Jun 29 '23 02:06 DamonsJ