phobos icon indicating copy to clipboard operation
phobos copied to clipboard

std.regex postprocesses ctRegex every time at runtime

Open dlangBugzillaToGithub opened this issue 9 years ago • 3 comments

greensunny12 reported this on 2016-08-31T23:32:56Z

Transfered from https://issues.dlang.org/show_bug.cgi?id=16457

CC List

  • dmitry.olsh (@DmitryOlshansky)
  • greeenify

Description

Consider the following program, it will crash as it allocates _a lot_ of memory before the garbage can be collected. 

void main()
{
    import std.regex;
    enum re = ctRegex!(`((c)(s)?)?ti`);
    import core.memory : GC;
    string text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";

    foreach (i; 0..500_000_000)
    {
        if (auto m = matchAll(text, re)) {}
        //if (i % 1_000_000)
            //GC.collect();
    }
}

On my machine (16G) it crashes at about 5M iterations.
The GC profile finds two hotspots (here 2M iterations): 

     2048000000	        2000000	uint[] D main std/regex/internal/parser.d:1607
      184000000	        2000000	std.regex.internal.ir.Bytecode[] D main std/array.d:852

(the latter is insertInPlace)

After looking at the code it seems pretty weird, because "postprocess" should be called while constructing the RegEx and once only.

dlangBugzillaToGithub avatar Aug 31 '16 23:08 dlangBugzillaToGithub

dmitry.olsh (@DmitryOlshansky) commented on 2016-09-03T19:51:57Z

(In reply to greensunny12 from comment #0)
> Consider the following program, it will crash as it allocates _a lot_ of
> memory before the garbage can be collected. 
> 
> void main()
> {
>     import std.regex;
>     enum re = ctRegex!(`((c)(s)?)?ti`);
>     import core.memory : GC;
>     string text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit";
> 
>     foreach (i; 0..500_000_000)
>     {
>         if (auto m = matchAll(text, re)) {}
>         //if (i % 1_000_000)
>             //GC.collect();
>     }
> }
> 
> On my machine (16G) it crashes at about 5M iterations.
> The GC profile finds two hotspots (here 2M iterations): 
> 
>      2048000000	        2000000	uint[] D main
> std/regex/internal/parser.d:1607
>       184000000	        2000000	std.regex.internal.ir.Bytecode[] D main
> std/array.d:852
> 
> (the latter is insertInPlace)
> 
> After looking at the code it seems pretty weird, because "postprocess"
> should be called while constructing the RegEx and once only.

The problem is enum re = ... line. Enum means ctRegex!`((c)(s)?)?ti` is copy-pasted at the place of usage, basically reconstructing regex on each loop iteration b/c compile-time version can't cache compiled patterns.

Replace enum with static and all should be good.

dlangBugzillaToGithub avatar Sep 03 '16 19:09 dlangBugzillaToGithub

greeenify commented on 2016-09-03T20:08:39Z

thanks a lot for looking at this so quickly :)

> The problem is enum re = ... line. Enum means ctRegex!`((c)(s)?)?ti` is copy-pasted at the place of usage, basically reconstructing regex on each loop iteration b/c compile-time version can't cache compiled patterns.

Okay that makes partially sense, but to give some background I found this problem because the D changelog generation crashed on my machine (16G) while parsing a 300kB log file. The dummy file is a dustmited & reduced version of the changelog generator:

https://github.com/dlang/tools/blob/master/changed.d#L77

And using `enum ... = ctRegex!...` seems to be a very common pattern:

https://github.com/search?q=enum+ctRegex&type=Code&utf8=%E2%9C%93

> Replace enum with static and all should be good.

Unfortunately I can still observe the same behavior, I also tried to move ctRegex to the matchAll.
Funnily with regex I don't get this allocation / out of memory -problem.

FYI static immutable doens't work:

foo.d(10): Error: template std.regex.matchAll cannot deduce function from argument types !()(string, immutable(StaticRegex!char)), candidates are:
/usr/include/dlang/dmd/std/regex/package.d(860):        std.regex.matchAll(R, RegEx)(R input, RegEx re) if (isSomeString!R && is(RegEx == Regex!(BasicElementOf!R)))
/usr/include/dlang/dmd/std/regex/package.d(868):        std.regex.matchAll(R, String)(R input, String re) if (isSomeString!R && isSomeString!String)
/usr/include/dlang/dmd/std/regex/package.d(875):        std.regex.matchAll(R, RegEx)(R input, RegEx re) if (isSomeString!R && is(RegEx == StaticRegex!(BasicElementOf!R)))

dlangBugzillaToGithub avatar Sep 03 '16 20:09 dlangBugzillaToGithub

dlang-bugzilla (@CyberShadow) commented on 2017-07-21T09:00:44Z

I think the obvious reasonable improvement here is to improve the documentation, and mention to not use ctRegex with enum. Not sure much can be done in Phobos to work around an underlying language / compiler implementation issue.

dlangBugzillaToGithub avatar Jul 21 '17 09:07 dlangBugzillaToGithub