regex - `(..).*\1` doesn't match "axxxx"
gassa (@GassaFM) reported this on 2015-12-31T18:21:20Z
Transfered from https://issues.dlang.org/show_bug.cgi?id=15489
CC List
- ag0aep6g
- alex.bleron
- bugzilla (@WalterBright)
- dlang-bugzilla (@CyberShadow)
- dmitry.olsh (@DmitryOlshansky)
- mipri
- oddp
- schuetzm
Description
I want to test for the pattern "two consecutive characters, arbitrary sequence, the same two consecutive characters". My pattern is r"(..).*\1". It works with bmatch but not with match/matchAll. A complete example follows.
import std.regex, std.stdio;
void main ()
{
writeln (bmatch ("abab", r"(..).*\1")); // [["abab", "ab"]]
writeln (match ("abab", r"(..).*\1")); // [["abab", "ab"]]
writeln (matchAll ("abab", r"(..).*\1")); // [["abab", "ab"]]
writeln (bmatch ("xabab", r"(..).*\1")); // [["abab", "ab"]]
writeln (match ("xabab", r"(..).*\1")); // []
writeln (matchAll ("xabab", r"(..).*\1")); // []
}
As you can see, bmatch (usage discouraged in the docs) gives me the result I want, but match (also discouraged) and matchAll (way to go) don't.
Original thread in D.learn: http://forum.dlang.org/thread/[email protected]. No one replied so far, so I assume this is a bug.
schuetzm commented on 2016-01-01T12:39:15Z
Digger blames this PR:
dmd: Merge pull request #2768 from dawgfoto/fix11406
https://github.com/D-Programming-Language/dmd/pull/2768
fix Issue 11406 - ld.gold breaks switch table jumps
Does the regex implementation use a jump table? Looks plausible.
schuetzm commented on 2016-01-01T12:42:09Z
bisect.ini:
bad = master
good = 2.064
tester = cd /tmp && dmd xx.d && ! ./xx | grep '\[]'
@ag0aep6g: If digger is to be believed, dmd is actually the right component.
gassa commented on 2016-01-01T16:02:32Z
(In reply to Marc Schütz from comment #1)
> Digger blames this PR:
>
> dmd: Merge pull request #2768 from dawgfoto/fix11406
>
> https://github.com/D-Programming-Language/dmd/pull/2768
>
> fix Issue 11406 - ld.gold breaks switch table jumps
>
> Does the regex implementation use a jump table? Looks plausible.
Strange. I tested with 2.069.2 but also with 2.063 and now with 2.063.2 (Jun 18, 2013 by this post: http://forum.dlang.org/post/[email protected]). There is no matchAll there yet, so I had to comment two lines, but with 2.063 and 2.063.2, match and bmatch still differ in output. And the pull is merged on Nov 17, 2013 (later). So perhaps the bug disappeared and reappeared multiple times.
dmitry.olsh (@DmitryOlshansky) commented on 2016-04-06T12:53:49Z
Ivan Kazmenko (In reply to Ivan Kazmenko from comment #0)
> I want to test for the pattern "two consecutive characters, arbitrary
> sequence, the same two consecutive characters". My pattern is r"(..).*\1".
> It works with bmatch but not with match/matchAll. A complete example
> follows.
>
> import std.regex, std.stdio;
> void main ()
> {
> writeln (bmatch ("abab", r"(..).*\1")); // [["abab", "ab"]]
> writeln (match ("abab", r"(..).*\1")); // [["abab", "ab"]]
> writeln (matchAll ("abab", r"(..).*\1")); // [["abab", "ab"]]
> writeln (bmatch ("xabab", r"(..).*\1")); // [["abab", "ab"]]
> writeln (match ("xabab", r"(..).*\1")); // []
> writeln (matchAll ("xabab", r"(..).*\1")); // []
> }
>
> As you can see, bmatch (usage discouraged in the docs) gives me the result I
> want, but match (also discouraged) and matchAll (way to go) don't.
>
Well this boils down to :
a) Use what works for now.
b) You might have found a case where Thompson engine simply cannot produce the right result.
Finally I should detect these patterns and switch between simple backtracking and thompson engine.
dmitry.olsh (@DmitryOlshansky) commented on 2017-08-25T10:10:55Z
*** Issue 17520 has been marked as a duplicate of this issue. ***
dmitry.olsh (@DmitryOlshansky) commented on 2017-08-25T10:13:16Z
*** Issue 16251 has been marked as a duplicate of this issue. ***
mipri commented on 2019-11-24T11:16:53Z
> a) Use what works for now.
Is this an option anymore? bmatch() no longer matches this result, and appears to have been deprecated in late 2017 as the source is now identical to that of match()
bugzilla (@WalterBright) commented on 2019-12-01T15:58:02Z
Seems to work meanwhile...
oddp commented on 2020-06-19T10:41:34Z
This backreference issue isn't fixed entirely:
void main () {
import std.regex, std.stdio;
auto r = r"(..).*\1";
foreach (s; ["abab", "abcab", "xababx", "xabcabx"])
writefln("%s %s %s %s", s, s.matchFirst(r), s.match(r), s.bmatch(r));
// abab ["abab", "ab"] [["abab", "ab"]] [["abab", "ab"]]
// abcab ["abcab", "ab"] [["abcab", "ab"]] [["abcab", "ab"]]
// xababx ["abab", "ab"] [["abab", "ab"]] [["abab", "ab"]]
// xabcabx [] [] []
}
Here are a handful ids that std.regex fails to match:
vrnyrnbfcx
jrqrkgdrqm
meflbefata
svdyayyedy
cwdpoivoie
Another hundred on the playground: https://run.dlang.io/is/O8slu9
// 100 out of 100 failed.
Here we are, 10 years later, and this bug is still present. I ran into it today. I reported it on the D forum, but I copy my example here too:
Here is a string:
const s = "xdwduffwgcptfwad";
The instructions says: "It contains a pair of any two letters that appears at least twice in the string without overlapping, like xyxy (xy) or aabcdefgaa (aa), but not like aaa (aa, but it overlaps)."
Here, "fw" appears twice. My D code:
auto m1 = matchFirst(s, regex(r"(..).*\1"));
returns an empty m1 object.
And here is a reduced example:
import std.regex;
import std.stdio;
void main()
{
auto re = regex(r"(.).*\1");
writeln("xaya".matchFirst(re)); // []
}
As a workaround, you can replace .* with {0,N} for some large value of N:
import std.regex;
import std.stdio;
void main()
{
auto re = regex(r"(.).{0,999999}\1");
writeln("xaya".matchFirst(re)); // ["aya", "a"]
}
As long as your input string is less that 999,999 characters long, this will give the same result as .*.