pyre2 icon indicating copy to clipboard operation
pyre2 copied to clipboard

unexpected perf results and questions

Open sarnold opened this issue 4 years ago • 3 comments

Since I need a good re2 python interface (and there appears to be many, largely unmaintained) I ended up testing this one and the google one using your performance.py script and the results are somewhat unexpected compared to the performance table in the README https://github.com/andreasvc/pyre2#performance

I made a small change to make it run with newer python return _wikidata.decode('utf8') which is maybe why the results look odd; can you verify whether this is correct or not?

re2-perf-data.txt

sarnold avatar Dec 04 '20 20:12 sarnold

Working with unicode adds overhead. If you have a use case where you can work with bytes, this is faster; and apparently, this is what is benchmarked in the performance script (which I didn't write). To make the script work across Python 2 and 3 while also having the best performance you should probably use bytes. I don't know if this explains the unexpected results, let me know if you discover more. I don't have time to look into this myself, but if I did I would investigate by profiling.

I don't really understand what you are benchmarking, what is "google-re2" and "py-re2" exactly? Why would the performance of Python's re from the standard module differ across these two? Don't know if that's a meaningful difference, that's supposed to be the baseline.

andreasvc avatar Dec 05 '20 20:12 andreasvc

Sorry if that wasn't clear; google-re2 is my cmake respin of the google python interface which you can find here: https://github.com/freepn/google-re2 and py-re2 is my fork of your repo (I added the dash to avoid name clashes). I was hoping the google-y one would be interface -compatible with pyre2/adblockpareser but it is not, so right now my only fallback is your pyre2.

sarnold avatar Dec 05 '20 22:12 sarnold

I see. I didn't know about the google-re2 Python bindings. Perhaps the README could explain the differences.

If performance is critical, then you should work with utf-8 encoded bytes strings. This is what RE2 uses internally. If you work with Python unicode strings, there will be encoding and decoding on every pyre2 call. RE2 actually fully supports unicode, even when you pass utf-8 encoded bytes strings.

If your fork contains any useful improvements, you're welcome to submit a pull request.

andreasvc avatar Dec 06 '20 22:12 andreasvc