regex icon indicating copy to clipboard operation
regex copied to clipboard

inconsistent matches when capturing group is present

Open BurntSushi opened this issue 5 years ago • 8 comments

This program

extern crate regex; // 1.0.5
use regex::Regex;

fn main() {
    let text = "foo\nbar\nbaz\n";
    
    let re = Regex::new(r#"(?m)^[ \n]*[a-z]+[ \n]*$"#).unwrap();
    for c in re.captures_iter(text) {
        println!("{:?}", (c.get(0).unwrap().start(), c.get(0).unwrap().end()));
    }
    
    println!("-----------------");
    
    let re = Regex::new(r#"(?m)(^)[ \n]*[a-z]+[ \n]*$"#).unwrap();
    for c in re.captures_iter(text) {
        println!("{:?}", (c.get(0).unwrap().start(), c.get(0).unwrap().end()));
    }
}

outputs

(0, 3)
(3, 7)
(7, 12)
-----------------
(0, 3)
(4, 7)
(8, 12)

but the outputs of each loop should be identical. This is almost certainly a case where the DFA gets it right (the first case) but the Pike VM/backtracker gets it wrong (the second case). My bet is that the compiler is producing bad byte code for the capture group. Taking a look at the byte code:

[andrew@Cheetah regex-debug]$ regex-debug compile '(?m)^[ \n]*[a-z]+[ \n]*$'
0000 Save(0) (start)
0001 StartLine
0002 Split(3, 4)
0003 '\n'-'\n', ' '-' ' (goto: 2)
0004 'a'-'z'
0005 Split(4, 6)
0006 Split(7, 8)
0007 '\n'-'\n', ' '-' ' (goto: 6)
0008 EndLine
0009 Save(1)
0010 Match(0)


[andrew@Cheetah regex-debug]$ regex-debug compile '(?m)(^)[ \n]*[a-z]+[ \n]*$'
0000 Save(0) (start)
0001 Save(2)
0002 StartLine
0003 Save(3)
0004 Split(5, 6)
0005 '\n'-'\n', ' '-' ' (goto: 4)
0006 'a'-'z'
0007 Split(6, 8)
0008 Split(9, 10)
0009 '\n'-'\n', ' '-' ' (goto: 8)
0010 EndLine
0011 Save(1)
0012 Match(0)

If there's an error here, I don't think I see it.

cc @retep998

BurntSushi avatar Sep 18 '18 20:09 BurntSushi

Here is a somewhat more minimal example I have independently identified.

Input /(a)\d*\.?\d+\b/ /a\d*\.?\d+\b/
a0.0c Matches substring a0.0 Matches substring a0

The presence of the capture group seems to affect the definition of word boundary somehow?

davisjam avatar Feb 16 '19 01:02 davisjam

I'm also running into this issue. When I place a capture group on a single character capture, sometimes it captures it, and sometimes it doesn't. The regex I'm using is ((?) with some other capture groups and sometimes the parentheses is captured and sometimes it's not. I can't seem to find a workaround. I may have to find another regex library. It's sad that this bug has been open for two years when it is easily reproducible.

paul-dev8 avatar Nov 27 '20 03:11 paul-dev8

Lol, my regex got changed after I submitted it. Anyway, it's an escaped parentheses in a capture group with a ? quantifier.

It works on small text, but larger texts causes it to be inconsistent.

paul-dev8 avatar Nov 27 '20 03:11 paul-dev8

@paul-dev8 Why not provide a program that others can reproduce instead of complaining how sad it is that volunteers haven't fixed it yet? Then maybe someone could help with a work around if possible.

BurntSushi avatar Nov 27 '20 03:11 BurntSushi

There's a sample program on top. Someone actually went to the effort of isolating the problem for you to obviously make it easy to fix. But it's languished for two long years. This library obviously has had a ton of work put into it, but if I find major bugs the first time I try to use it, those should be fixed first before working on "features". A library fundamentally needs to be correct before anything else.

paul-dev8 avatar Nov 29 '20 16:11 paul-dev8

OK, have it your way, I've blocked you from my account. I have almost no patience for people who try to lecture other volunteers on how they spend their free time, doubly so for when they are wrong. 1) the problem isn't necessarily easy to fix given that the root cause hasn't been identified yet and 2) I've spent the last couple years working on correctness problems instead of "features" (whatever that means).

BurntSushi avatar Nov 29 '20 17:11 BurntSushi

Hello, @BurntSushi, can I ask if there's any progress on this? Thank you! <3

qbz avatar May 17 '22 09:05 qbz

This will be fixed once #656 lands (along with many other bugs).

I have no specific timeline.

BurntSushi avatar May 17 '22 12:05 BurntSushi