regex Increased memory usage when updating to regex 1.10

Increased memory usage when updating to regex 1.10

Open Marwes opened this issue 1 year ago • 0 comments

What version of regex are you using?

1.10, and I used 1.7 before. Issue seems to be mainly be due the rewrite in 1.9

Describe the bug at a high level.

After updating to regex 1.10 I am seeing greatly increased memory usage (captured using the dhat crate. see example below). In particular part of the issue seems to be due to the use of capture groups in the regex. These captures only serve to group the regex so they could (and should) be non-capturing groups and I have fixed this on my end, however since captures do not seem to matter on 1.7 I guess there may be a missed optimization here? (https://github.com/rust-lang/regex/issues/1059 comes to mind).

(The regex in the example has been altered but it remains the same in spirit and exhibits the same memory increase)

What are the steps to reproduce the behavior?

The following code can be used to reproduce the behavior by using dhat to track memory and changing the regex version.

// Cargo.toml
// regex = "=1.10"
// dhat = "0.3"

#[global_allocator]
static ALLOC: dhat::Alloc = dhat::Alloc;

fn main() {
    without_captures();
    with_captures();
}

fn without_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (?:craigslist\.org$)|
        (?:utexas\.edu$)|
        (?:blogs\.com$)|
        (?:is\.gd$)|
        (?:vkontakte\.ru$)|
        (?:google\.com\.hk$)|
        (?:vimeo\.com$)|
        (?:simplemachines\.org$)|
        (?:plala\.or\.jp$)|
        (?:npr\.org$)|
        (?:census\.gov$)|
        (?:360\.cn$)|
        (?:wisc\.edu$)|
        (?:princeton\.edu$)|
        (?:addthis\.com$)|
        (?:google\.de$)|
        (?:ox\.ac\.uk$)|
        (?:free13runpool\.com$)|
        (?:berkeley\.edu$)|
        (?:fda\.gov$)|
        (?:soundcloud\.com$)|
        (?:ftc\.gov$)|
        (?:cloudflare\.com$)|
        (?:com\.com$)|
        (?:statcounter\.com$)|
        (?:tumblr\.com$)|
        (?:alexa\.com$)|
        (?:canalblog\.com$)|
        (?:uiuc\.edu$)|
        (?:msu\.edu$)|
        (?:bravesites\.com$)|
        (?:usatoday\.com$)|
        (?:edublogs\.org$)|
        (?:forbes\.com$)|
        (?:patch\.com$)|
        (?:1688\.com$)|
        (?:ihg\.com$)|
        (?:ow\.ly$)|
        (?:usda\.gov$)|
        (?:yellowbook\.com$)|
        (?:wired\.com$)|
        (?:homestead\.com$)|
        (?:state\.tx\.us$)|
        (?:webnode\.com$)|
        (?:123-reg\.co\.uk$)|
        (?:irs\.gov$)|
        (?:yale\.edu$)|
        (?:naver\.com$)|
        (?:elpais\.com$)|
        (?:example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

fn with_captures() {
    let _profiler = dhat::Profiler::builder().testing().build();

    let regex = r#"(?ux-mUis)
        (craigslist\.org$)|
        (utexas\.edu$)|
        (blogs\.com$)|
        (is\.gd$)|
        (vkontakte\.ru$)|
        (google\.com\.hk$)|
        (vimeo\.com$)|
        (simplemachines\.org$)|
        (plala\.or\.jp$)|
        (npr\.org$)|
        (census\.gov$)|
        (360\.cn$)|
        (wisc\.edu$)|
        (princeton\.edu$)|
        (addthis\.com$)|
        (google\.de$)|
        (ox\.ac\.uk$)|
        (free13runpool\.com$)|
        (berkeley\.edu$)|
        (fda\.gov$)|
        (soundcloud\.com$)|
        (ftc\.gov$)|
        (cloudflare\.com$)|
        (com\.com$)|
        (statcounter\.com$)|
        (tumblr\.com$)|
        (alexa\.com$)|
        (canalblog\.com$)|
        (uiuc\.edu$)|
        (msu\.edu$)|
        (bravesites\.com$)|
        (usatoday\.com$)|
        (edublogs\.org$)|
        (forbes\.com$)|
        (patch\.com$)|
        (1688\.com$)|
        (ihg\.com$)|
        (ow\.ly$)|
        (usda\.gov$)|
        (yellowbook\.com$)|
        (wired\.com$)|
        (homestead\.com$)|
        (state\.tx\.us$)|
        (webnode\.com$)|
        (123-reg\.co\.uk$)|
        (irs\.gov$)|
        (yale\.edu$)|
        (naver\.com$)|
        (elpais\.com$)|
        (example\.com$)
    "#;

    let regex = regex::Regex::new(regex).unwrap();

    let m = regex.is_match("webnode.com");
    eprintln!("Match `{m}`, with captures: {:#?}", dhat::HeapStats::get());
}

Memory stats from running the example

Most of the stats are the same, but we can see a 5x increase in memory when using capturing groups in 1.10.

https://docs.rs/dhat/latest/dhat/struct.HeapStats.html

1.7.3

Match `true`, with captures: HeapStats {
    total_blocks: 4137,
    total_bytes: 1189678,
    curr_blocks: 48,
    curr_bytes: 114285,
    max_blocks: 212,
    max_bytes: 247538,
}
Match `true`, with captures: HeapStats {
    total_blocks: 4152,
    total_bytes: 1201606,
    curr_blocks: 48,
    curr_bytes: 121921,
    max_blocks: 212,
    max_bytes: 247338,
}

1.10.2


Match `true`, with captures: HeapStats {
    total_blocks: 3486,
    total_bytes: 763125,
    curr_blocks: 221,
    curr_bytes: 160832,
    max_blocks: 1215,
    max_bytes: 228249,
}
Match `true`, with captures: HeapStats {
    total_blocks: 3694,
    total_bytes: 1871135,
    curr_blocks: 221,
    curr_bytes: 1242544,
    max_blocks: 216,
    max_bytes: 1242568,
}

Oct 27 '23 14:10 Marwes

regex regex copied to clipboard

Increased memory usage when updating to regex 1.10

What version of regex are you using?

Describe the bug at a high level.

What are the steps to reproduce the behavior?

Memory stats from running the example

1.7.3

1.10.2

regex
regex copied to clipboard