regex icon indicating copy to clipboard operation
regex copied to clipboard

Test Regex Scraped from crates.io

Open ethanpailes opened this issue 7 years ago • 5 comments

As part of evaluating my masters thesis, I scraped crates.io for regex and ran the resulting regex though my compiler to see how many could be optimized. I don't think it would be too hard to clean up my scraping script a bit and then write a test which executes each of the regex on a quickcheck generated string with all (3? 5? 6? depends how you count them) backends. I'm not sure when I'll do this, but I wanted to leave a note here so that I don't forget.

ethanpailes avatar Apr 23 '18 15:04 ethanpailes

:hearts:

That would be interesting! One thing worth pointing out that a quickcheck generated string is very unlikely to produce a case that causes a match to happen. Instead, it would only test non-match agreement. Which still seems like a worthwhile thing!

BurntSushi avatar Apr 23 '18 15:04 BurntSushi

Hmm. Good point. One thing I remember reading in a note Russ Cox made about the testing approach for RE2 is that he wrote some code to construct a random matching string from a regex, so it would might be worthwhile to give that a crack. The two sources of random input would probably do a pretty good job of testing both the positive and negative cases.

ethanpailes avatar Apr 23 '18 15:04 ethanpailes

Yup, that's another good avenue to try!

BurntSushi avatar Apr 23 '18 15:04 BurntSushi

After a first pass at this just using quickcheck to generate random input, I've turned up the following failing test cases.

extern crate regex;

#[test]
fn word_boundary_backtracking_default_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_re = ExecBuilder::new(r"\b")
        .bounded_backtracking()
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_re = ExecBuilder::new(r"\b")
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = "䅅\\u{a0}";

    let fi1 = backtrack_re.find_iter(input);
    let fi2 = default_re.find_iter(input);
    for (m1, m2) in fi1.zip(fi2) {
        assert_eq!(m1, m2);
    }
}

#[test]
fn uppercut_s_backtracking_bytes_default_bytes_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_bytes_re = ExecBuilder::new("^S")
        .bounded_backtracking()
        .only_utf8(false)
        .build()
        .map(|exec| exec.into_byte_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_bytes_re = ExecBuilder::new("^S")
        .only_utf8(false)
        .build()
        .map(|exec| exec.into_byte_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = vec![83, 83];

    let s1 = backtrack_bytes_re.split(&input);
    let s2 = default_bytes_re.split(&input);
    for (chunk1, chunk2) in s1.zip(s2) {
        assert_eq!(chunk1, chunk2);
    }
}

#[test]
fn unicode_lit_star_backtracking_utf8bytes_default_utf8bytes_mismatch() {
    use regex::internal::ExecBuilder;

    let backtrack_bytes_re = ExecBuilder::new(r"^(?u:\*)")
        .bounded_backtracking()
        .bytes(true)
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let default_bytes_re = ExecBuilder::new(r"^(?u:\*)")
        .bytes(true)
        .build()
        .map(|exec| exec.into_regex())
        .map_err(|err| format!("{}", err))
        .unwrap();

    let input = "**";

    let s1 = backtrack_bytes_re.split(input);
    let s2 = default_bytes_re.split(input);
    for (chunk1, chunk2) in s1.zip(s2) {
        assert_eq!(chunk1, chunk2);
    }
}

The last two look like the are probably dups.

My work on this currently lives here, and I think it is basically ready for a PR except for a few config issues.

  1. Right now to run the checks I just added a new test binary and marked the entrypoint with #[test]. This is bad for a few different reasons.
    • The tests take a long time to run, and I'm not sure they should be in regular CI. If it is possible to have two different test profiles (one for regular CI and a more complete one for releases and major new features), that would be great.
    • There is a lot of work going on for one #[test], and we really should be seeing output to the screen to be able to monitor progress (--nocapture works for this when you are focusing on the test, but it would not fit into a bigger run of the test suite). Ideally when you do cargo test it would just invoke a binary without capturing the output and then report that the suite failed if the exit code is non-zero. I think I once saw that there was a way to do this on a stackoverflow post, but my google-fu has failed me.
  2. The specific cases that I just pulled out should definitly be part of the regular test suite, but they are not dependant on the definition of the regex! or regex_set! test macro, so they should not be run for every different test config. I'm not sure where the best place to put them for that is. I've just stashed them in the test_crates_regex test that I made for now, but if we turn that off for CI, it is the wrong place to put them.

ethanpailes avatar Apr 27 '18 18:04 ethanpailes

It may be worth looking at https://crates.io/crates/regex_generate when I get around to generating matching strings from a regex.

ethanpailes avatar May 02 '18 16:05 ethanpailes

I'm going to say that this is closed by the work done a while back in #472. If there's something else we should, please feel free to file a new issue!

BurntSushi avatar Mar 06 '23 18:03 BurntSushi