quamina icon indicating copy to clipboard operation
quamina copied to clipboard

Consider using RE2 (aka golang `regexp`) for wildcard matches

Open evankanderson opened this issue 2 years ago • 4 comments

You might be interested to know that the Go regexp library uses finite automata already to implement the regexp package. From that package:

The regexp implementation provided by this package is guaranteed to run in time linear in the size of the input. ... For more information about this property, see

https://swtch.com/~rsc/regexp/regexp1.html

It should be trivial(1) to replace the current shellstyle matcher with a translation to the regexp module, and to check for performance differences.

(1) exercise left to the reader

The one place where shellstyle might be faster than regexp is if you require start and end matching and allow only one glob in the middle. At that point, shellstyle becomes a shortcut for startswith("stuff before *") AND endswith("stuff after *") AND len(string) > len(expression).

evankanderson avatar Jul 01 '22 19:07 evankanderson

// An RE2::Set represents a collection of regexps that can
// be searched for simultaneously.

  // Adds pattern to the set using the options passed to the constructor.
  // Returns the index that will identify the regexp in the output of Match(),
  // or -1 if the regexp cannot be parsed.
  // Indices are assigned in sequential order starting from 0.
  // Errors do not increment the index; if error is not NULL, *error will hold
  // the error message from the parser.
  int Add(const StringPiece& pattern, std::string* error);

Might be relevant? Source

jsmorph avatar Jul 01 '22 20:07 jsmorph

First of all, I believe it's not just a Go thing, more or less every regex library has finite automata inside, isn't that the only sane approach?

RE2::Set at one level sounds much like what Quamina implements. The interesting thing about Quamina is that you can add as many patterns as you want and it doesn't affect the performance very much. Another useful thing about Quamina is that each pattern you add has an identifier and the Match operation tells you which of the patterns you added matched.

Anyhow, I'll have a closer look at the Go regexp library. I wonder if it gives programmatic access to the DFA produced by compiling the regex? If so, that would be incredibly useful. If not, it still might be worthwhile forking regexp in order to steal that compiler.

Personally, over in the IETF JSONPATH Working Group we're working on IRegexp, https://datatracker.ietf.org/doc/draft-ietf-jsonpath-iregexp/ and that's the subset I'd like to try to support in Quamina.

timbray avatar Jul 01 '22 23:07 timbray

If you read the linked paper, you'll see that both pcre and Java have implemented features that require backtracking rather than a finite automata approach.

I mostly mentioned re2 (open source as well) and Go's regexp libraries because they tend to be well-benchmarked and performance tuned already. (And yes, matching should be roughly O(n) worst-case on the length of the input once compilation is complete.)

evankanderson avatar Jul 02 '22 08:07 evankanderson

Right, and of course having backtracking doesn't mean you're not using DFAs. Sometime I will share the funny/horrible story about how the Java regex implementation blew up JRuby on Rails. Anyhow, as noted, I'll take a close look as to whether we can get at Go regexp's automaton-building code. Of course, that software is aimed at answering the question “Does this regex match this string and where does it start?" while Quamina aims at the question “Which of the many patterns in this instance match this string?” but there still may be value to share.

timbray avatar Jul 02 '22 22:07 timbray