Regex101 eregex (posix-extended) flavor

eregex (posix-extended) flavor

Open hammer opened this issue 9 years ago • 36 comments

As implemented by Boost, for example.

Nov 06 '14 12:11 hammer

I'm pretty sure all of that is already implemented. Is there something specific?

Nov 06 '14 16:11 firasdib

I see only "pcre (php)", "javascript", and "python" in the "flavor" section.

Nov 06 '14 18:11 hammer

Yes, but POSIX-Extended regex falls under the PCRE-category.

On 06 Nov 2014, at 19:00, Jeff Hammerbacher [email protected] wrote:

I see only "pcre (php)", "javascript", and "python" in the "flavor" section.

— Reply to this email directly or view it on GitHub https://github.com/firasdib/Regex101/issues/165#issuecomment-62022306.

Nov 06 '14 18:11 firasdib

@firasdib: POSIX's matching behavior is different from PCRE/JavaScript/Python. POSIX searches for left-most longest string, while most other flavors gives the left most string, which is the first string which matches the expansion of the search tree.

A simple example is a|aaa on input string "aaa". POSIX will always match "aaa" without fail, and other flavors will match "a".

By the way, collation is one feature available in POSIX which most other flavors are lacking of. The feature is powerful, and allows you to search for all characters similar to a with very simple syntax.

Nov 17 '14 04:11 nhahtdh

I see. In that case I would need to compile a POSIX-regex engine down to JS. Is there a stand-alone C-library I can use, @nhahtdh ?

Dec 04 '14 10:12 firasdib

@firasdib: You may want to take a look at Boost Regex library, but I don't know whether it implements any extension to the base specification or not, since the document at opengroup leaves some undefined behavior.

Another thing is the locale-specific behavior of POSIX regex, which is currently not implemented in the interface.

Dec 05 '14 02:12 nhahtdh

I'm not sure what you'r looking for, but I would start with glibc implementation.

https://code.woboq.org/userspace/glibc/posix/regex_internal.c.html
https://code.woboq.org/userspace/glibc/posix/regexec.c.html

Sep 23 '17 17:09 cheako

I think regex support is included in emscripten libc/musl/src/regex, you just need to write a wrapper and compile it.

Sep 26 '17 05:09 cheako

I've tested this out on a simple example and emcc did all on it's own build and include the backend functions: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L12716 https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L7033 Example usage: https://github.com/cheako/glibc/blob/emscripten-hack/test.js#L1943

Sep 29 '17 01:09 cheako

Now that there is a Javascript implementation, how long until we can see this being added?

Oct 21 '17 08:10 cheako

Is this still being developed?

Mar 29 '18 17:03 izxle

I figured out what I needed by writing many small regex programs, brute force. Though I would still like to see this implemented in case I need it in the future.

Mar 30 '18 07:03 cheako

@cheako If you could explain the process more thoroughly, I could look into this again.

Is this related to https://github.com/firasdib/Regex101/issues/431 ?

Jan 08 '19 15:01 firasdib

That's great. All I had to do was use https://www.gnu.org/software/libc/manual/html_node/Matching-POSIX-Regexps.html and emscripten includes support automatically. As for if this is also the same as used in boost I wouldn't know.

Just keep in mind that there are a number of flags used when compiling posix regex, like to use an regex extension.

Jan 08 '19 17:01 cheako

@cheako Thanks. I'll have to read up and see what the implications of implementing this flavor is.

Jan 08 '19 19:01 firasdib

Just ran into this when I tried my fav regex tester with a POSIX regex and it validated \d+ which is not valid in POSIX regex. It should be [0-9]+ or [[:digit:]]

So this tool needs a true POSIX flavour ;)

May 20 '20 09:05 pke

Getting that 7-year itch? :hourglass:

May 05 '21 01:05 Roy-Orbison

Hey, now this could be implemented simply with wasm. If it wasn't seen previously as simple.

May 05 '21 01:05 cheako

I wasted my time putting a good comment in #1230 before I saw this issue. So I'll just copy it to here:

I can probably add some context here (in addition to an upvote for the feature request). From the grep man page:

grep understands three different versions of regular expression syntax: “basic” (BRE), “extended” (ERE) and “perl” (PCRE). In GNU grep there is no difference in available functionality between basic and extended syntaxes. In other implementations, basic regular expressions are less powerful. The following description applies to extended regular expressions; differences for basic regular expressions are summarized afterwards. Perl-compatible regular expressions give additional functionality, and are documented in pcresyntax(3) and pcrepattern(3), but work only if PCRE is available in the system.

The rest of that page goes on to document BRE vs. ERE and is a great reference for anybody looking. Long story short, the only difference between the two is escaping and handling of special characters (?, +, (), {}, |). Support is as follows:

awk: only supports ERE

sed: supports BRE (default) or ERE (with -E)

grep: supports BRE (default) or ERE (with -E). May support PCRE (with -P) but this is pretty uncommon; bourne shell doesn't support it

BRE (default) and ERE (activated with the -E flag) are the two implementations that are always available, and are also used for sed. -P for PCRE is out there but pretty uncommon, and not available with sed or awk anyway.

So, long story short, adding BRE and ERE support to the best string-related resource out there would be amazing!

The wasm idea really isn't bad. The usual rust regex parser is closer to perl than BRE/ERE and I don't see any other implementations, but that would honestly be kind of fun to code.

May 12 '22 13:05 tgross35

I just want to add my support for adding both POSIX BRE and ERE. Relevant excerpt from re_format(7) man page:

REGEX(7)                            Linux Programmer's Manual                            REGEX(7)

NAME
       regex - POSIX.2 regular expressions

DESCRIPTION
       Regular expressions ("RE"s), as defined in POSIX.2, come in two forms: modern REs (roughly
       those of egrep; POSIX.2 calls these "extended" REs) and obsolete  REs  (roughly  those  of
       ed(1); POSIX.2 "basic" REs).  Obsolete REs mostly exist for backward compatibility in some
       old programs; they will be discussed at the end.  POSIX.2 leaves some aspects of RE syntax
       and  semantics open; "(!)" marks decisions on these aspects that may not be fully portable
       to other POSIX.2 implementations.

May 17 '23 17:05 UrsineRaven

Yes, but POSIX-Extended regex falls under the PCRE-category.

On 06 Nov 2014, at 19:00, Jeff Hammerbacher [email protected] wrote: I see only "pcre (php)", "javascript", and "python" in the "flavor" section. — Reply to this email directly or view it on GitHub #165 (comment).

no, it doesn't

grafik the second line should match (and in fact does with libregex), and the parenthesised group should just be t. In regex101 it doesn't match with any flavour.

Jul 27 '23 15:07 knollet

Did you include the g flag?

I've never encountered the {m}? pattern before, what is that expected to match exactly?

Jul 27 '23 18:07 firasdib

I haven't looked into this, but just a plain reading and I see. The first line captures 51 chars and the second line only captures 1. For the pattern to make sense, though, it should end in $?

I remember from when I was trying to use this, I tested a good number of settings and couldn't find one that matched experiment. If you feel a certain set of settings matches the behavior, please add an alias for those settings. That way, ppl won't have to go back and forth with you as the settings are further refined. I believe the power of this tool is the confidence using it gives, if users are required to guess settings that say one thing but are meant to mean another that removes this valuable feature.

Jul 27 '23 18:07 cheako

@pke

Just ran into this when I tried my fav regex tester with a POSIX regex and it validated \d+ which is not valid in POSIX regex. It should be [0-9]+ or [[:digit:]]

On the day you commented that, the current release of boost was v1.73.0 which supports \d. However, it's not universal so regex101 would probably need to allow such escape sequences, but put a warning in the Explanation panel.

@knollet You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1

@firasdib The {m}? is valid syntax for making the previous atom ungreedy, but since {m} is a fixed length, the ? does nothing. It would have to be {m,n}? or have an extra grouping like (.{50})? to have any effect.

Still hoping ERE will be supported one day.

Jul 28 '23 01:07 Roy-Orbison

@pke

You have an off-by-one error. Your test pattern matches 51 characters in the subgroup, not 50: https://regex101.com/r/A12aQJ/1

https://jdoodle.com/ia/Ktj

I did test this, I got .{2}?. is either 3 or 1 in size.

// Online C compiler to run C program online
#include <stdio.h>

#include <string.h>

#include <regex.h>

int main() {
  char * source_a = "1234";
  char * source_b = "12";
  char * regexString = "(.{2}?.).*$";
  size_t maxGroups = 2;

  regex_t regexCompiled;
  regmatch_t groupArray[maxGroups];

  if (regcomp( & regexCompiled, regexString, REG_EXTENDED)) {
    printf("Could not compile regular expression.\n");
    return 1;
  };

  {
    char * source = source_a;
    if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
      unsigned int g = 0;
      for (g = 0; g < maxGroups; g++) {
        if (groupArray[g].rm_so == (size_t) - 1)
          break; // No more groups

        char sourceCopy[strlen(source) + 1];
        strcpy(sourceCopy, source);
        sourceCopy[groupArray[g].rm_eo] = 0;
        printf("Group %u: [%2ld-%2ld]: %s\n",
          g, groupArray[g].rm_so, groupArray[g].rm_eo,
          sourceCopy + groupArray[g].rm_so);
      }
    }
  }

  {
    char * source = source_b;
    if (regexec( & regexCompiled, source, maxGroups, groupArray, 0) == 0) {
      unsigned int g = 0;
      for (g = 0; g < maxGroups; g++) {
        if (groupArray[g].rm_so == (size_t) - 1)
          break; // No more groups

        char sourceCopy[strlen(source) + 1];
        strcpy(sourceCopy, source);
        sourceCopy[groupArray[g].rm_eo] = 0;
        printf("Group %u: [%2ld-%2ld]: %s\n",
          g, groupArray[g].rm_so, groupArray[g].rm_eo,
          sourceCopy + groupArray[g].rm_so);
      }
    }
  }
  regfree( & regexCompiled);

  return 0;
}

Group 0: [ 0- 4]: 1234
Group 1: [ 0- 3]: 123
Group 0: [ 0- 2]: 12
Group 1: [ 0- 1]: 1

Jul 28 '23 01:07 cheako

@cheako I meant in @knollet's PCRE example, the ? does not behave as a {0,1} shortcut. It's acting as an ungreedy modifier for an atom that cannot match fewer than 50 characters, so it shouldn't be expected that the () group will match 50 characters (because of the extra .). That makes the assertion "the second line should match" untrue.

Clearly the behaviour in ERE is different, as you demonstrated, which adds to the case for a separate ERE flavour.

Jul 28 '23 01:07 Roy-Orbison

... so it shouldn't be expected that the () group will match 50 characters (because of the extra .). That makes the assertion "the second line should match" untrue.

I didn't say I expected the first group to match 50. I know it matches 51 OR 1. I said, the second line should match completely. The parenthesized group matches t (1 char, the last . in the group), and the .* (greedily) matches the rest. A $ for it to "make sense" is not really neccessary, it only may be for some use cases, which I am not talking about here.

@cheako I meant in @knollet's PCRE example, the ? does not behave as a {0,1} shortcut. It's acting as an ungreedy modifier

How would it be possible for ? to act as an ungreedy modifier on something, which in itself has no options? x{50} matches exactly 50 xes. Not "one OR more" as +, which you could modify with "as few as possible" or "as many as possible".

But ok, the very similar x{1,50} on the other hand would make sense to modify. I would have to check if it is. Still, there is a certani behaviour of posix-extended which is not represented in any of these flavours.

Jul 28 '23 09:07 knollet

@firasdib

Did you include the g flag?

Yes. But I don't see what difference that should make. Even without g as meaning "global" giving all the matches instead of only the first should still return the ... first. But it returns none.

Jul 28 '23 09:07 knollet

@knollet I misunderstood you. You meant ‘If PCRE and ERE had equivalent matching, the second line would match in this PCRE example, but it doesn't.’ We can verify your example with grep as an ERE:

grep -E '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567

then PCRE:

grep -P '(.{50}?.).*' <<<'the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 123456
the quick brown fox jumps over the lazy dog 1234567'
the quick brown fox jumps over the lazy dog 1234567890123456789
the quick brown fox jumps over the lazy dog 1234567

Jul 31 '23 00:07 Roy-Orbison

I don't get how that test is accurate, you are not printing the capture group.

I'm unsure where grep(is that a GNU extension?) gets its PCRE *implementation so for those playing along at home YMMV, if you don't get good results you can try this emulator: https://www.tutorialspoint.com/linux_terminal_online.php

There is no reason to expect it's consistent from one grep install to another.

$ perl -e'print"$1:$2\n"if("1234" =~ /((.{2}?.).*)$/)'
1234:123
$ perl -e'print"$1:$2\n"if("12" =~ /((.{2}?.).*)$/)'
$

Jul 31 '23 03:07 cheako

Regex101 Regex101 copied to clipboard

eregex (posix-extended) flavor

Regex101
Regex101 copied to clipboard