node-re2 icon indicating copy to clipboard operation
node-re2 copied to clipboard

Exposing RE2::Set

Open Exter-N opened this issue 7 years ago • 11 comments

The RE2::Set class allows to match a string against several regular expressions in a single pass, seemingly more efficiently than piping all the regexes together : https://github.com/google/re2/blob/f2cc1aeb5de463c45d020c446cbcb028385b49f3/re2/set.h#L21-L23

It could be exposed to JavaScript code by :

  • accepting an Array of patterns as the first parameter of the constructor ;
  • adding functionality to the matching methods to use RE2::Set to identify which one of the patterns matches (it doesn't seem to go further than that), and then use regular RE2s corresponding to the identified patterns to get more information ;
  • in .exec(), returning the index of the pattern which matched and/or the pattern itself as properties of the returned array (maybe with Symbol keys, to eliminate the risks of name collisions with future properties that could be defined by the ECMAScript spec) ;
  • returning the piped patterns in the source property, for compatibility ;
  • exposing a new sources property (or a property with a Symbol key) containing the individual patterns ;
  • either returning an array of the translated patterns in the internalSource property, or applying the same process as for the source property.

Use cases could be optimizing anything that boils down to this kind of code :

let match;
if ((match = re1.exec(str)) !== null) {
    // …
} else if ((match = re2.exec(str)) !== null) {
    // …
} else if ((match = re3.exec(str)) !== null) {
    // …
} /* potentially lots of other cases */ else {
    // …
}

For example, a HTTP router, a lexer …

What do you think about such a feature?

Exter-N avatar Jun 06 '18 21:06 Exter-N

Sounds useful. We can implement it as a custom RE2-specific feature.

Points to clarify:

  1. Super simple JavaScript API.
    // possible sketch
    
    // creating
    var iterable = [/1/, /2/];
    const reSet = new RE2.Set(iterable);
    // any iterable should do
    // the set is created and compiled
    // I don't think it make sense to add regular expressions to the set dynamically
    
    // using
    reSet.test(str);
    reSet.exec(str);
    
  2. (nice to have) A benchmark to show, if this feature is faster than: a) A bunch of RE2 objects. b) A bunch of RegExp objects.

Aside

I have a project, which was trying to generate an optimal lexer automatically: https://github.com/uhop/parser-toolkit

One optimization I was trying to do is converting simple matchers into one to increase the overall performance. For example, I have two matchers: /\d+/ and /\w+/, so one way to combine them into something useable would be like that:

/^(\d+)|(\w+)/

^ is obviously to establish an anchor, which can be done differently with a sticky bit. In reality, my chains of matchers were much longer.

The technique worked and produced a long list of possible matches, all but one of them are empty. And the only way to find out the matched group was a linear search with its n/2 complexity. It would be nice to have a way to find a match and know what match actually happened. I don't know if RE2::Set provides a solution to this problem. It sure would be nice to solve something like that.

uhop avatar Jun 07 '18 21:06 uhop

The API I was thinking of was in my opinion even simpler, something like :

let re = new RE2("pattern", "flags"); // no change compared to the current version
re.exec("input"); // no change either, the same goes for .test, .match, .replace, .search, .split, .toString, .source, .flags, …
re.sources; // undefined

let reSet = new RE2([ "in", "put", "some other pattern", … ], "gi");
// assume reSet.lastIndex = 0; before each of the following lines
reSet.exec("input"); // [ "in", index: 0, groups: undefined, input: "input", patternIndex: 0, pattern: "in" ]
reSet.test("input"); // true
"input".match(reSet); // same as 2 lines above
"input".search(reSet); // 0
"input".split(reSet); // [ "", "", "" ]
"input".replace(reSet, () => "X"); // "XX"
"input".replace(reSet, [ () => "X", () => "Y" ]); // "XY"
reSet.flags; // "gi"
reSet.source; // "in|put|some other pattern|…"
reSet.sources; // [ "in", "put", "some other pattern", … ]
reSet.toString(); // "/in|put|some other pattern|…/gi"

Exter-N avatar Jun 09 '18 23:06 Exter-N

As for determining the matched pattern, if I understood the RE2::Set API correctly, it returns a list of the matched patterns (but no more information, so the match needs to be re-run with the individual patterns to get position/group info). I'll have to run some tests to determine the performance cost of all this. By the way, do you want to use a benchmark framework/runner in particular ?

Exter-N avatar Jun 10 '18 03:06 Exter-N

By the way, do you want to use a benchmark framework/runner in particular?

Because it is an internal deal, not published anywhere, I don't really care as long as it is reliable.

uhop avatar Jun 10 '18 16:06 uhop

Writing an adapter which enables one to write for (Local<Value> element: object) { in C++ and have the same behavior as a JS for (let element of object) { is an … interesting exercise. 🤔

Anyway, I'm progressing, step after step, when I find time for it.

Exter-N avatar Jun 20 '18 21:06 Exter-N

Looking into this RE2 option, it seems less useful: apparently it is used only as extension of test(). It means it cannot return matches, groups, and so on. Just a Boolean value. I am closing this ticket until we have a definitive decision and its execution plan.

uhop avatar Jul 12 '22 16:07 uhop

Using Set is a 40x performance improvement for our use when we need to test a string against multiple (100+) regexes.

Would suggest something like the following:

const set = new RE2Set(patterns)
for (const idx of set.test('asdasdasd')) {
  console.log(pattern[idx], 'match!')
}

ronag avatar Sep 14 '24 07:09 ronag

40x is a compelling number. I guess it makes this feature back in the game.

uhop avatar Sep 15 '24 15:09 uhop

Similar benefit here.

wrmedford avatar Nov 23 '25 18:11 wrmedford

Thank you, @wrmedford. I'll take a look at the PR and play with it a bit.

In unrelated news: I've noticed that I forgot to create a ticket for re2::FilteredRE2. Here you go: #232. Care to take a look and see if this is something you want to try your hand at implementing? 😉

uhop avatar Nov 24 '25 00:11 uhop

I can definitely take a look. This one is tied to a ticket I have at work so it might take me a minute to loop back around :)

wrmedford avatar Nov 24 '25 00:11 wrmedford