compile-time-regular-expressions Unexpected behavior with a bullet point character `•`

trafficstars

Say, I have a string

    char8_t text[] = u8"• test\n - two\n ••    three\n-• four\n";

I would like to substitute any number of consecutive blank characters, -, or • with just a single space. I tried the following:

    char8_t* b = text + std::size(text)-1;
    for (char8_t* r = text;;) {
      auto m = ctre::search<u8R"([\s\-•]+)">(r,b);
      if (!m) break;
      char8_t* w = m.begin();
      r = m.end();
      if (r==b) {
        b = w;
        break;
      }
      *w++ = ' ';
      if (w!=r) {
        memmove(w,r,b-r);
        b -= r-w;
        r = w;
      }
    }
    *b = '\0';

    cout << ((char*)text) << endl;

But this results in

    • test two •• three • four

I'm including <ctre-unicode.hpp>.

Is this a bug or the intended behavior?

At first, I thought that maybe the problem is with putting a • inside [], because maybe [] only accepts single-byte characters and escape sequences, but I get the same output with (?:[\s\-]|•)+ as with the original [\s\-•]+. And \P{L}+ results in, what I'm assuming is, removal of only some of the bytes comprising the • characters:

    � test two � � three � four

Here's a godbolt link.

Mar 15 '22 15:03 ivankp

Currently with two iterators you can't trigger special utf8 iterators.

This is a workaround: https://godbolt.org/z/Kz5arc1qE

Not sure how to do it nicely, your other options are in wrapper.hpp lines 156-184

Keeping this open, if I found a better solution.

Mar 15 '22 16:03 hanickadot

Thank you for the quick response! So, ctre::search decides whether to treat the input as utf8 or bytes based on the type of the argument (right now only if it's a single argument, i.e. std::u8string_view vs. std::string_view). May I suggest making this decision either based on the type of the template parameter string, or a tag type passed as another template parameter, or defining ctre::search and ctre::search_u8? I think this would (1) avoid the ambiguity of whether we are treating the string as unicode or not, (2) make it more convenient to work with utf8 strings represented as regular old char*, and (3) avoid the back and forth casting. More concerning point (2). char8_t and u8string_view are very new, so most codebases aren't implemented to return these type as is. Plus, correct me if I'm wrong about this, but unless one is working in some specific domain, wouldn't one expect strings to be encoded in utf8 by default? The only thing I'm trying to suggest is that relying on the argument character type being char vs char8_t seems a bit more awkward than having ctre::search and ctre::search_u8.

Mar 15 '22 17:03 ivankp

It's actually based on type of argument's iterator. You can always take std::string_view and mark it ctre::utf8_range. The name of "function" just names the algorithm, type of arguments marks the semantics of code-unit/code-points. Making _u8 function would lead into making _u16 and _u32 functions which is not something I want to do.

Mar 15 '22 17:03 hanickadot

compile-time-regular-expressions compile-time-regular-expressions copied to clipboard

Unexpected behavior with a bullet point character `•`

compile-time-regular-expressions
compile-time-regular-expressions copied to clipboard