Exposing RE2::Set
The RE2::Set class allows to match a string against several regular expressions in a single pass, seemingly more efficiently than piping all the regexes together :
https://github.com/google/re2/blob/f2cc1aeb5de463c45d020c446cbcb028385b49f3/re2/set.h#L21-L23
It could be exposed to JavaScript code by :
- accepting an
Arrayof patterns as the first parameter of the constructor ; - adding functionality to the matching methods to use
RE2::Setto identify which one of the patterns matches (it doesn't seem to go further than that), and then use regularRE2s corresponding to the identified patterns to get more information ; - in
.exec(), returning the index of the pattern which matched and/or the pattern itself as properties of the returned array (maybe withSymbolkeys, to eliminate the risks of name collisions with future properties that could be defined by the ECMAScript spec) ; - returning the piped patterns in the
sourceproperty, for compatibility ; - exposing a new
sourcesproperty (or a property with aSymbolkey) containing the individual patterns ; - either returning an array of the translated patterns in the
internalSourceproperty, or applying the same process as for thesourceproperty.
Use cases could be optimizing anything that boils down to this kind of code :
let match;
if ((match = re1.exec(str)) !== null) {
// …
} else if ((match = re2.exec(str)) !== null) {
// …
} else if ((match = re3.exec(str)) !== null) {
// …
} /* potentially lots of other cases */ else {
// …
}
For example, a HTTP router, a lexer …
What do you think about such a feature?
Sounds useful. We can implement it as a custom RE2-specific feature.
Points to clarify:
- Super simple JavaScript API.
// possible sketch // creating var iterable = [/1/, /2/]; const reSet = new RE2.Set(iterable); // any iterable should do // the set is created and compiled // I don't think it make sense to add regular expressions to the set dynamically // using reSet.test(str); reSet.exec(str); - (nice to have) A benchmark to show, if this feature is faster than:
a) A bunch of
RE2objects. b) A bunch ofRegExpobjects.
Aside
I have a project, which was trying to generate an optimal lexer automatically: https://github.com/uhop/parser-toolkit
One optimization I was trying to do is converting simple matchers into one to increase the overall performance. For example, I have two matchers: /\d+/ and /\w+/, so one way to combine them into something useable would be like that:
/^(\d+)|(\w+)/
^ is obviously to establish an anchor, which can be done differently with a sticky bit. In reality, my chains of matchers were much longer.
The technique worked and produced a long list of possible matches, all but one of them are empty. And the only way to find out the matched group was a linear search with its n/2 complexity. It would be nice to have a way to find a match and know what match actually happened. I don't know if RE2::Set provides a solution to this problem. It sure would be nice to solve something like that.
The API I was thinking of was in my opinion even simpler, something like :
let re = new RE2("pattern", "flags"); // no change compared to the current version
re.exec("input"); // no change either, the same goes for .test, .match, .replace, .search, .split, .toString, .source, .flags, …
re.sources; // undefined
let reSet = new RE2([ "in", "put", "some other pattern", … ], "gi");
// assume reSet.lastIndex = 0; before each of the following lines
reSet.exec("input"); // [ "in", index: 0, groups: undefined, input: "input", patternIndex: 0, pattern: "in" ]
reSet.test("input"); // true
"input".match(reSet); // same as 2 lines above
"input".search(reSet); // 0
"input".split(reSet); // [ "", "", "" ]
"input".replace(reSet, () => "X"); // "XX"
"input".replace(reSet, [ () => "X", () => "Y" ]); // "XY"
reSet.flags; // "gi"
reSet.source; // "in|put|some other pattern|…"
reSet.sources; // [ "in", "put", "some other pattern", … ]
reSet.toString(); // "/in|put|some other pattern|…/gi"
As for determining the matched pattern, if I understood the RE2::Set API correctly, it returns a list of the matched patterns (but no more information, so the match needs to be re-run with the individual patterns to get position/group info). I'll have to run some tests to determine the performance cost of all this. By the way, do you want to use a benchmark framework/runner in particular ?
By the way, do you want to use a benchmark framework/runner in particular?
Because it is an internal deal, not published anywhere, I don't really care as long as it is reliable.
Writing an adapter which enables one to write for (Local<Value> element: object) { in C++ and have the same behavior as a JS for (let element of object) { is an … interesting exercise. 🤔
Anyway, I'm progressing, step after step, when I find time for it.
Looking into this RE2 option, it seems less useful: apparently it is used only as extension of test(). It means it cannot return matches, groups, and so on. Just a Boolean value. I am closing this ticket until we have a definitive decision and its execution plan.
Using Set is a 40x performance improvement for our use when we need to test a string against multiple (100+) regexes.
Would suggest something like the following:
const set = new RE2Set(patterns)
for (const idx of set.test('asdasdasd')) {
console.log(pattern[idx], 'match!')
}
40x is a compelling number. I guess it makes this feature back in the game.
Similar benefit here.
Thank you, @wrmedford. I'll take a look at the PR and play with it a bit.
In unrelated news: I've noticed that I forgot to create a ticket for re2::FilteredRE2. Here you go: #232. Care to take a look and see if this is something you want to try your hand at implementing? 😉
I can definitely take a look. This one is tied to a ticket I have at work so it might take me a minute to loop back around :)