IronJS icon indicating copy to clipboard operation
IronJS copied to clipboard

The .NET regular expression engine's capturing behavior is not the same as the ECMAScript standard.

Open otac0n opened this issue 14 years ago • 7 comments

For regular expressions such as this: ((a+)?(b+)?c+)*

There are 3 capturing groups (one for each left-parenthesis).

If this is matched against a string like the following: bbbccaac

The .NET implementation will list the following capture groups: ((a+)?(b+)?c) = "aac" (a+) = "aa" (b+) = "bbb"

Whereas the ECMAScript spec specifies the following capturing behavior: ((a+)?(b+)?c) = "aac" (a+) = "aa" (b+) = undefined

The .NET implementation gives no indication that the (b+) capturing group did not participate in its most recent match attempt.

otac0n avatar Apr 02 '11 14:04 otac0n

Does using RegexOptions.ECMAScript help?

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexoptions(v=VS.100).aspx

hakanson avatar Apr 02 '11 20:04 hakanson

@hakanson: We are already using the ECMAScript option, which works well for the most part. It is just this little piece that is different.

otac0n avatar Apr 02 '11 21:04 otac0n

I think this is something we'll have to live with for now, doing a custom regular expression implementation for this small detail is too much for too little gain currently. I'll leave the ticket open, and we'll look into it eventually.

fholm avatar Apr 03 '11 10:04 fholm

-1 for me for not looking in the code in Core.fs

    let options = (options ||| RegexOptions.ECMAScript) &&& ~~~RegexOptions.Compiled
    let key = (options, pattern)
    this.RegExp <- env.RegExpCache.Lookup key (fun () -> new Regex(pattern, options ||| RegexOptions.Compiled))

I'm new to F#; does this mean you are implementing your own compiled RegExp cache? I ask because there is a Regex.CacheSize Property that controls an internal cache of compiled regular expressions. I assume it gave you more control to have your own cache, but thought I would add for completeness (as the risk of looking uninformed a second time on the same issue).

http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.cachesize.aspx

hakanson avatar Apr 24 '11 23:04 hakanson

Yes we do maintain our own regexp cache, we found it to be faster actually.

fholm avatar Apr 24 '11 23:04 fholm

We found that in a loop like this...

while (true)
{
    var r = new RegExp("...");
}

...that .NET's regex cache was not helping.

When we implemented the regexp cache shown above, we saw a 50% reduction in the time on the SunSpider regexp test.

otac0n avatar Apr 25 '11 00:04 otac0n

@otac0n - From the looks of it the BCL only caches for static methods on the Regex object so the increase in performance makes sense.

ChaosPandion avatar Apr 25 '11 00:04 ChaosPandion