regex icon indicating copy to clipboard operation
regex copied to clipboard

Various inconsistencies between different engines

Open SeanRBurton opened this issue 7 years ago • 3 comments

I've generated several failing test-cases which exercise various dark corners, and should make for good regression tests. I did my best to make them mostly orthogonal, but most of the issues seem to be word-boundary related.

fn test() {
    let patterns = [
        "(?:(?-u:\\b)|(?u:h))+",
        "(?u:\\B)",
        "(?:(?u:\\b)|(?s-u:.))+",
        "(?:(?-u:\\B)|(?su:.))+",
        "(?m:$)(?m:^)(?su:.)",
        "(?m:$)^(?m:^)",
        "(?P<kp>(?iu:do)(?m:$))*",

        "(?u:\\B)",
        "(?:(?-u:\\b)|(?u:[\u{0}-W]))+",
        "((?m:$)(?-u:\\B)(?s-u:.)(?-u:\\B)$)",
        "(?m:$)(?m:$)^(?su:.)",
        "(?-u:\\B)(?m:^)",
        "(?:(?u:\\b)|(?-u:.))+",
    ];
    let haystacks = [
        "h",
        "鋸",
        "oB",
        "\u{fef80}",
        "\n‣",
        "\n",
        "dodo",

        "䡁",
        "0",
        "\n\n",
        "\n\u{81}¨\u{200a}",
        "0\n",
        "0",
    ];
    for (i, (pattern, haystack)) in patterns.iter()
                                            .zip(haystacks.iter()).enumerate() {
        let re0 = ExecBuilder::new(&pattern).only_utf8(false)
                                            .build()
                                            .unwrap()
                                            .into_regex();
        let re1 = ExecBuilder::new(&pattern).only_utf8(false)
                                            .nfa()
                                            .bytes(i < 7)
                                            .build()
                                            .unwrap()
                                            .into_regex();
        let caps0 = re0.captures(haystack);
        let caps1 = re1.captures(haystack);
        let mut correct = true;
        match (caps0, caps1) {
            (Some(a), Some(b)) => {
                for (c0, c1) in a.iter().zip(b.iter()) {
                    match (c0, c1) {
                        (Some(c), Some(d)) => {
                            if c.start() != d.start() || c.end() != d.end() {
                                correct = false;
                                break;
                            }
                        }
                        (None, None) => (),
                        _ => {
                            correct = false;
                            break;
                        }
                    }
                }
            }
            _ => correct = false,
        }
        println!("{:?}", correct);
    }
}

SeanRBurton avatar Dec 11 '17 15:12 SeanRBurton

@SeanRBurton: good work! What follow-up to this do you have in mind?

sanmai-NL avatar Jan 31 '18 16:01 sanmai-NL

@sanmai-NL Eh? If there are inconsistencies between the different engines, then we should fix them. These are hard bugs to fix because they require deep context, and likely don't appear often, so they haven't hit the top of my list to look at.

BurntSushi avatar Jan 31 '18 16:01 BurntSushi

Yes, so I just meant to ask the reporter to suggest how to prioritize any potential improvements. He seems to take an interest in this issue considering he’s filing it. 🙂

sanmai-NL avatar Feb 11 '18 17:02 sanmai-NL

These are all fixed in my in-progress regex-automata work that will land once #656 is done.

BurntSushi avatar Mar 06 '23 19:03 BurntSushi