regex Performance: Optimise HIR case folding

In the HIR translation stage, if in Unicode and ignore-case mode, case folding interval sets can take exceedingly long. This PR optimises case folding through the following techniques:

Determining if case folding is necessary.
- If we have already case folded, then we do not need to do so again.
- If we have already case folded and are performing a set operation with another interval set that has already been case folded, we do not need to do so again. (Short-hand evidence: if two interval sets are already case folded, then we can consider their set operation to not be operating on individual characters, but instead sets of case folded characters, thus the result of any set operation is necessarily also a set of sets of case folded characters)
- If we perform a set operation and the result is an interval set which we already know is case folded, then we do not need to do so again.
Optimising case folding itself.
- Most case folding operations occur over ranges of characters which are in order and adjacent. By keeping track of the previous location of a character, we can remove the need for additional binary searches for the next character.
- Removes unnecessary iterations when we already know how far away the next foldable character is by changing the range based on the error response of the case folding operation.
Hashing interval sets so that we can determine if an interval set has changed after an operation (necessary for technique 1).

These changes dramatically improve performance in these conditions. To demonstrate this, consider the following regex and test program:

(?i:[[:^space:]------------------------------------------------------------------------])

fn main() -> Result<(), Box<dyn std::error::Error>> {
    for pattern in std::env::args().skip(1) {
        const ITERS: u32 = 10;
        for _ in 0..ITERS {
            let _ = regex_syntax::Parser::new().parse(&pattern);
        }
    }
    Ok(())
}

Placing the test program in examples/test-pattern.rs, we can observe the runtime of the program before the changes:

[addisoncrump@addisoncrump-main regex]$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real	0m8,983s
user	0m8,934s
sys	0m0,014s

And after the changes:

[addisoncrump@addisoncrump-main regex]$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real	0m0,088s
user	0m0,069s
sys	0m0,018s

Jul 21 '22 18:07 addisoncrump

Nice find and nice work! So I think what I'd like to do is figure out, to what extent, each of these optimizations actually helps things. The hashing and the case table lookups in particular represent a fairly big boost in complexity that I would really like to make sure is worth it. Additionally, the hashing trick in particular poses a somewhat more practical issue for the medium term where I plan to add a alloc-only mode to regex-syntax, and the hashing routines aren't available in alloc. So for the alloc-only mode, we'd either need to figure out an alternative strategy, not do it at all, implement our own hashing or bring in another crate. None are particularly appetizing to me...

So with that said, I think the thing I'd like to know here is how much (1) helps all on its own. You mention that (3) is required for (1), but I'm not actually convinced of that? I think it's probably required if you want to optimize out every last bit of redundant work, but my suspicion is that we might be able to get away without hashing. That is, if you're performing a set operation on any two classes, your patch here already leverages the fact that if they're already both case folded than the result must be case folded too. It only falls back to hashing in the case where one wasn't already case folded. I wonder, how often does that happen in practice? And if it does happen often, perhaps we can change the caller patterns such that it happens less.

Another thing I'd like to do in this space is to introduce a separate character class nesting limit, but I think that's out of scope for this PR here.

Jul 21 '22 18:07 BurntSushi

Hm, okay -- I'll try feature-gating various sections, then re-using the fuzzer I used to find this particular input to find troublemaking inputs with various feature flags.

Jul 21 '22 18:07 addisoncrump

I ended up investigating this a bit, and I was able to fix the perf problem with your regex by just using technique (1) (with a tweak or two). So I've stuck with just that for now.

I'm also going to look into a nest limit for character classes.

Mar 06 '23 02:03 BurntSushi

I ended up thinking more about your case folding optimizations as well and decided to take a crack at it. But I did not want to make the type signature more unruly than it already was. So I ended up just creating a stateful type that knows how to do the right thing, and in the process, greatly simplified the caller. So the caller now looks like this:

    fn case_fold_simple(
        &self,
        ranges: &mut Vec<ClassUnicodeRange>,
    ) -> Result<(), unicode::CaseFoldError> {
        let mut folder = unicode::SimpleCaseFolder::new()?;
        if !folder.overlaps(self.start, self.end) {
            return Ok(());
        }
        let (start, end) = (u32::from(self.start), u32::from(self.end));
        for cp in (start..=end).filter_map(char::from_u32) {
            for &cp_folded in folder.mapping(cp) {
                ranges.push(ClassUnicodeRange::new(cp_folded, cp_folded));
            }
        }
        Ok(())
    }

And here's the case folder:

/// A state oriented traverser of the simple case folding table.
///
/// A case folder can be constructed via `SimpleCaseFolder::new()`, which will
/// return an error if the underlying case folding table is unavailable.
///
/// After construction, it is expected that callers will use
/// `SimpleCaseFolder::mapping` by calling it with codepoints in strictly
/// increasing order. For example, calling it on `b` and then on `a` is illegal
/// and will result in a panic.
///
/// The main idea of this type is that it tries hard to make mapping lookups
/// fast by exploiting the structure of the underlying table, and the ordering
/// assumption enables this.
#[derive(Debug)]
pub struct SimpleCaseFolder {
    /// The simple case fold table. It's a sorted association list, where the
    /// keys are Unicode scalar values and the values are the corresponding
    /// equivalence class (not including the key) of the "simple" case folded
    /// Unicode scalar values.
    table: &'static [(char, &'static [char])],
    /// The last codepoint that was used for a lookup.
    last: Option<char>,
    /// The index to the entry in `table` corresponding to the smallest key `k`
    /// such that `k > k0`, where `k0` is the most recent key lookup. Note that
    /// in particular, `k0` may not be in the table!
    next: usize,
}

impl SimpleCaseFolder {
    /// Create a new simple case folder, returning an error if the underlying
    /// case folding table is unavailable.
    pub fn new() -> Result<SimpleCaseFolder, CaseFoldError> {
        #[cfg(not(feature = "unicode-case"))]
        {
            Err(CaseFoldError(()))
        }
        #[cfg(feature = "unicode-case")]
        {
            Ok(SimpleCaseFolder {
                table: crate::unicode_tables::case_folding_simple::CASE_FOLDING_SIMPLE,
                last: None,
                next: 0,
            })
        }
    }

    /// Return the equivalence class of case folded codepoints for the given
    /// codepoint. The equivalence class returned never includes the codepoint
    /// given. If the given codepoint has no case folded codepoints (i.e.,
    /// no entry in the underlying case folding table), then this returns an
    /// empty slice.
    ///
    /// # Panics
    ///
    /// This panics when called with a `c` that is less than or equal to the
    /// previous call. In other words, callers need to use this method with
    /// strictly increasing values of `c`.
    pub fn mapping(&mut self, c: char) -> &'static [char] {
        if let Some(last) = self.last {
            assert!(
                last < c,
                "got codepoint U+{:X} which occurs before \
                 last codepoint U+{:X}",
                u32::from(c),
                u32::from(last),
            );
        }
        self.last = Some(c);
        if self.next >= self.table.len() {
            return &[];
        }
        let (k, v) = self.table[self.next];
        if k == c {
            self.next += 1;
            return v;
        }
        match self.get(c) {
            Err(i) => {
                self.next = i;
                &[]
            }
            Ok(i) => {
                // Since we require lookups to proceed
                // in order, anything we find should be
                // after whatever we thought might be
                // next. Otherwise, the caller is either
                // going out of order or we would have
                // found our next key at 'self.next'.
                assert!(i > self.next);
                self.next = i + 1;
                self.table[i].1
            }
        }
    }

    /// Returns true if and only if the given range overlaps with any region
    /// of the underlying case folding table. That is, when true, there exists
    /// at least one codepoint in the inclusive range `[start, end]` that has
    /// a non-trivial equivalence class of case folded codepoints. Conversely,
    /// when this returns false, all codepoints in the range `[start, end]`
    /// correspond to the trivial equivalence class of case folded codepoints,
    /// i.e., itself.
    ///
    /// This is useful to call before iterating over the codepoints in the
    /// range and looking up the mapping for each. If you know none of the
    /// mappings will return anything, then you might be able to skip doing it
    /// altogether.
    ///
    /// # Panics
    ///
    /// This panics when `end < start`.
    pub fn overlaps(&self, start: char, end: char) -> bool {
        use core::cmp::Ordering;

        assert!(start <= end);
        self.table
            .binary_search_by(|&(c, _)| {
                if start <= c && c <= end {
                    Ordering::Equal
                } else if c > end {
                    Ordering::Greater
                } else {
                    Ordering::Less
                }
            })
            .is_ok()
    }

    /// Returns the index at which `c` occurs in the simple case fold table. If
    /// `c` does not occur, then this returns an `i` such that `table[i-1].0 <
    /// c` and `table[i].0 > c`.
    fn get(&self, c: char) -> Result<usize, usize> {
        self.table.binary_search_by_key(&c, |&(c1, _)| c1)
    }
}

I think this is still the same fundamental idea in your patch, where basically it keeps track of the last position and looks for an adjacent match before launching into binary search.

I didn't do the "Removes unnecessary iterations" optimization though.

Mar 06 '23 03:03 BurntSushi

Awesome! I will close this PR in favour of these changes. It appears that the changes do dramatically improve the performance, but not quite at the ratio presented originally:

$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real    0m7,437s
user    0m7,416s
sys     0m0,021s

Becomes:

$ time cargo run --release --example test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/examples/test-pattern '(?i:[[:^space:]------------------------------------------------------------------------])'`

real    0m0,087s
user    0m0,060s
sys     0m0,024s

(perhaps not the most scientific result, but I do not have the original machine I tested on anymore :slightly_smiling_face:)

Mar 06 '23 13:03 addisoncrump

Ah no I'd like to leave this PR open. It should get closed automatically once my own changes land.

That new ratio looks good to me.

We can continue to improve here too. With the new case folder, I think there's more room to optimize it, including with your additional trick.

Mar 06 '23 13:03 BurntSushi