regex Various inconsistencies between different engines

I've generated several failing test-cases which exercise various dark corners, and should make for good regression tests. I did my best to make them mostly orthogonal, but most of the issues seem to be word-boundary related.

fn test() {
    let patterns = [
        "(?:(?-u:\\b)|(?u:h))+",
        "(?u:\\B)",
        "(?:(?u:\\b)|(?s-u:.))+",
        "(?:(?-u:\\B)|(?su:.))+",
        "(?m:$)(?m:^)(?su:.)",
        "(?m:$)^(?m:^)",
        "(?P<kp>(?iu:do)(?m:$))*",

        "(?u:\\B)",
        "(?:(?-u:\\b)|(?u:[\u{0}-W]))+",
        "((?m:$)(?-u:\\B)(?s-u:.)(?-u:\\B)$)",
        "(?m:$)(?m:$)^(?su:.)",
        "(?-u:\\B)(?m:^)",
        "(?:(?u:\\b)|(?-u:.))+",
    ];
    let haystacks = [
        "h",
        "鋸",
        "oB",
        "\u{fef80}",
        "\n‣",
        "\n",
        "dodo",

        "䡁",
        "0",
        "\n\n",
        "\n\u{81}¨\u{200a}",
        "0\n",
        "0",
    ];
    for (i, (pattern, haystack)) in patterns.iter()
                                            .zip(haystacks.iter()).enumerate() {
        let re0 = ExecBuilder::new(&pattern).only_utf8(false)
                                            .build()
                                            .unwrap()
                                            .into_regex();
        let re1 = ExecBuilder::new(&pattern).only_utf8(false)
                                            .nfa()
                                            .bytes(i < 7)
                                            .build()
                                            .unwrap()
                                            .into_regex();
        let caps0 = re0.captures(haystack);
        let caps1 = re1.captures(haystack);
        let mut correct = true;
        match (caps0, caps1) {
            (Some(a), Some(b)) => {
                for (c0, c1) in a.iter().zip(b.iter()) {
                    match (c0, c1) {
                        (Some(c), Some(d)) => {
                            if c.start() != d.start() || c.end() != d.end() {
                                correct = false;
                                break;
                            }
                        }
                        (None, None) => (),
                        _ => {
                            correct = false;
                            break;
                        }
                    }
                }
            }
            _ => correct = false,
        }
        println!("{:?}", correct);
    }
}

Dec 11 '17 15:12 SeanRBurton

@SeanRBurton: good work! What follow-up to this do you have in mind?

Jan 31 '18 16:01 sanmai-NL

@sanmai-NL Eh? If there are inconsistencies between the different engines, then we should fix them. These are hard bugs to fix because they require deep context, and likely don't appear often, so they haven't hit the top of my list to look at.

Jan 31 '18 16:01 BurntSushi

Yes, so I just meant to ask the reporter to suggest how to prioritize any potential improvements. He seems to take an interest in this issue considering he’s filing it. 🙂

Feb 11 '18 17:02 sanmai-NL

These are all fixed in my in-progress regex-automata work that will land once #656 is done.

Mar 06 '23 19:03 BurntSushi

regex regex copied to clipboard

Various inconsistencies between different engines

regex
regex copied to clipboard