FerretDB
FerretDB copied to clipboard
Support `$regex`'s `x` option
Cover with great tests and fuzzing.
Hi, I was studying a little bit a codebase yesterday and I'm interested in this issue. Wouldn't you mind if I take that on me?
Sure. Please make sure that you add good tests for that. I will try to help with fuzzing as soon as possible :)
There's not too much about free-spacing mode on internet, as far as I found there are also small differences between engines, finally some of them doesn't implement it at all...
My approach is to use every regex input with "x" flag as an argument in a function which just translates multi-line, free-spaced, commented regex into a string understandable by Go regexp
package.
However I'm not sure if my condidtions correspond to the intended behavior.
Conditions that I'm not sure:
- Spaces between
[ ]
shouldn't be removed from a string - Spaces between
{ }
shouldnt't be also removed. (a{10}
matches all text wherea
is repeated 10 times in a row, buta{1 0}
just match "a{1 0}" in a text)
I have tried to find an answer in mongodb source code but still I'm not sure about that. If anybody knows an answer I would be more than happy to hear about it. If something is not clear let me know, I'll do my best to clarify.
Also I understand that there's a lot of work to do on this project so I will just try to implement that this way and continue to search.
@noisersup would you please join slack? https://join.slack.com/t/ferretdb/shared_invite/zt-zqe9hj8g-ZcMG3~5Cs5u9uuOPnZB8~A
@seeforschauer I'm already there!
Conditions that I'm not sure
In that case, we use integration tests to do the same thing as MongoDB. Please add a test there. See there for an overview of our testing.
Doing a regexp pattern preprocessing as described above would require full-fledged regex parsing to avoid issues when working with []
and {}
.
The Go standard regex/syntax
package is not a good fit here as it wasn't really made for external users. It creates an AST that is good for regex
package compilation, but otherwise, it's hard to work with. For instance, it's hard to convert it back to string (or use it to construct a new, preprocessed string).
I used this package to do regexp parsing in a couple of linters: https://github.com/quasilyte/regex/tree/master/syntax
It supports most of the PCRE syntax as well, which can be handy since MongoDB uses this dialect.