phobos icon indicating copy to clipboard operation
phobos copied to clipboard

regex - `(..).*\1` doesn't match "axxxx"

Open dlangBugzillaToGithub opened this issue 10 years ago • 12 comments

gassa (@GassaFM) reported this on 2015-12-31T18:21:20Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=15489

CC List

  • ag0aep6g
  • alex.bleron
  • bugzilla (@WalterBright)
  • dlang-bugzilla (@CyberShadow)
  • dmitry.olsh (@DmitryOlshansky)
  • mipri
  • oddp
  • schuetzm

Description

I want to test for the pattern "two consecutive characters, arbitrary sequence, the same two consecutive characters".  My pattern is r"(..).*\1".  It works with bmatch but not with match/matchAll.  A complete example follows.

import std.regex, std.stdio;
void main ()
{
	writeln (bmatch   ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (match    ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (matchAll ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
	writeln (bmatch   ("xabab", r"(..).*\1"));  // [["abab", "ab"]]
	writeln (match    ("xabab", r"(..).*\1"));  // []
	writeln (matchAll ("xabab", r"(..).*\1"));  // []
}

As you can see, bmatch (usage discouraged in the docs) gives me the result I want, but match (also discouraged) and matchAll (way to go) don't.

Original thread in D.learn: http://forum.dlang.org/thread/[email protected].  No one replied so far, so I assume this is a bug.

dlangBugzillaToGithub avatar Dec 31 '15 18:12 dlangBugzillaToGithub

schuetzm commented on 2016-01-01T12:39:15Z

Digger blames this PR:

    dmd: Merge pull request #2768 from dawgfoto/fix11406
    
    https://github.com/D-Programming-Language/dmd/pull/2768
    
    fix Issue 11406 - ld.gold breaks switch table jumps

Does the regex implementation use a jump table? Looks plausible.

dlangBugzillaToGithub avatar Jan 01 '16 12:01 dlangBugzillaToGithub

schuetzm commented on 2016-01-01T12:42:09Z

bisect.ini:
bad  = master
good = 2.064
tester = cd /tmp && dmd xx.d && ! ./xx | grep '\[]'

@ag0aep6g: If digger is to be believed, dmd is actually the right component.

dlangBugzillaToGithub avatar Jan 01 '16 12:01 dlangBugzillaToGithub

gassa commented on 2016-01-01T16:02:32Z

(In reply to Marc Schütz from comment #1)
> Digger blames this PR:
> 
>     dmd: Merge pull request #2768 from dawgfoto/fix11406
>     
>     https://github.com/D-Programming-Language/dmd/pull/2768
>     
>     fix Issue 11406 - ld.gold breaks switch table jumps
> 
> Does the regex implementation use a jump table? Looks plausible.

Strange.  I tested with 2.069.2 but also with 2.063 and now with 2.063.2 (Jun 18, 2013 by this post: http://forum.dlang.org/post/[email protected]).  There is no matchAll there yet, so I had to comment two lines, but with 2.063 and 2.063.2, match and bmatch still differ in output.  And the pull is merged on Nov 17, 2013 (later).  So perhaps the bug disappeared and reappeared multiple times.

dlangBugzillaToGithub avatar Jan 01 '16 16:01 dlangBugzillaToGithub

dmitry.olsh (@DmitryOlshansky) commented on 2016-04-06T12:53:49Z

Ivan Kazmenko (In reply to Ivan Kazmenko from comment #0)
> I want to test for the pattern "two consecutive characters, arbitrary
> sequence, the same two consecutive characters".  My pattern is r"(..).*\1". 
> It works with bmatch but not with match/matchAll.  A complete example
> follows.
> 
> import std.regex, std.stdio;
> void main ()
> {
> 	writeln (bmatch   ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
> 	writeln (match    ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
> 	writeln (matchAll ("abab",  r"(..).*\1"));  // [["abab", "ab"]]
> 	writeln (bmatch   ("xabab", r"(..).*\1"));  // [["abab", "ab"]]
> 	writeln (match    ("xabab", r"(..).*\1"));  // []
> 	writeln (matchAll ("xabab", r"(..).*\1"));  // []
> }
> 
> As you can see, bmatch (usage discouraged in the docs) gives me the result I
> want, but match (also discouraged) and matchAll (way to go) don't.
> 

Well this boils down to :
a) Use what works for now.
b) You might have found a case where Thompson engine simply cannot produce the right result.

Finally I should detect these patterns and switch between simple backtracking and thompson engine.

dlangBugzillaToGithub avatar Apr 06 '16 12:04 dlangBugzillaToGithub

dmitry.olsh (@DmitryOlshansky) commented on 2017-08-25T10:10:55Z

*** Issue 17520 has been marked as a duplicate of this issue. ***

dlangBugzillaToGithub avatar Aug 25 '17 10:08 dlangBugzillaToGithub

dmitry.olsh (@DmitryOlshansky) commented on 2017-08-25T10:13:16Z

*** Issue 16251 has been marked as a duplicate of this issue. ***

dlangBugzillaToGithub avatar Aug 25 '17 10:08 dlangBugzillaToGithub

mipri commented on 2019-11-24T11:16:53Z

> a) Use what works for now.
Is this an option anymore? bmatch() no longer matches this result, and appears to have been deprecated in late 2017 as the source is now identical to that of match()

dlangBugzillaToGithub avatar Nov 24 '19 11:11 dlangBugzillaToGithub

bugzilla (@WalterBright) commented on 2019-12-01T15:58:02Z

Seems to work meanwhile...

dlangBugzillaToGithub avatar Dec 01 '19 15:12 dlangBugzillaToGithub

oddp commented on 2020-06-19T10:41:34Z

This backreference issue isn't fixed entirely:

void main () {
    import std.regex, std.stdio;
    auto r = r"(..).*\1";
    foreach (s; ["abab", "abcab", "xababx", "xabcabx"])
        writefln("%s %s %s %s", s, s.matchFirst(r), s.match(r), s.bmatch(r));
    
    // abab    ["abab",  "ab"] [["abab",  "ab"]] [["abab",  "ab"]]
    // abcab   ["abcab", "ab"] [["abcab", "ab"]] [["abcab", "ab"]]
    // xababx  ["abab",  "ab"] [["abab",  "ab"]] [["abab",  "ab"]]
    // xabcabx [] [] []
}

Here are a handful ids that std.regex fails to match:

vrnyrnbfcx
jrqrkgdrqm
meflbefata
svdyayyedy
cwdpoivoie

Another hundred on the playground: https://run.dlang.io/is/O8slu9

// 100 out of 100 failed.

dlangBugzillaToGithub avatar Jun 19 '20 10:06 dlangBugzillaToGithub

Here we are, 10 years later, and this bug is still present. I ran into it today. I reported it on the D forum, but I copy my example here too:

Here is a string:

const s = "xdwduffwgcptfwad";

The instructions says: "It contains a pair of any two letters that appears at least twice in the string without overlapping, like xyxy (xy) or aabcdefgaa (aa), but not like aaa (aa, but it overlaps)."

Here, "fw" appears twice. My D code:

auto m1 = matchFirst(s, regex(r"(..).*\1"));

returns an empty m1 object.

jabbalaci avatar Nov 23 '25 14:11 jabbalaci

And here is a reduced example:

import std.regex;
import std.stdio;

void main()
{
    auto re = regex(r"(.).*\1");
    writeln("xaya".matchFirst(re)); // []
}

pbackus avatar Nov 23 '25 14:11 pbackus

As a workaround, you can replace .* with {0,N} for some large value of N:

import std.regex;
import std.stdio;

void main()
{
    auto re = regex(r"(.).{0,999999}\1");
    writeln("xaya".matchFirst(re)); // ["aya", "a"]
}

As long as your input string is less that 999,999 characters long, this will give the same result as .*.

pbackus avatar Nov 23 '25 16:11 pbackus