pyre2
pyre2 copied to clipboard
regex slowness
using latest pyre2 and RE2, when trying to proccess a big regex for webserver logs pythons re is far more quick than pyre2. weblog structure: '$domain $remote_addr $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$request_time" One can reproduce the bug by adding this on tests/performance.py:
def getweblogdata():
return open('/var/log/apache2/access.log') ##Mine is about 9mb
@register_test("weblog scan",
r'^(([\w\d]|[\w][\w\d\-]*[\w\d])\.)*([\w]|[\w][\w\d\-]*[\w\d])[\s]([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})[\s]*(.+)[\s]\[(.*)\][\s]"(GET|HEAD|POST|PUT|DELETE|OPTIONS)[\s](.*)[\s](HTTP.*)"[\s]([\d]*)[\s]([\d]*)[\s]"(.*)"[\s]"(.+)"[\s]"(.+)"',
1,
data=getweblogdata())
def weblog_matches(pattern, data):
"""
Match weblog data line by line.
"""
total=0
for line in data:
p = pattern.match(line)
if p==None:
continue
total += len(p.groups())
data.seek(0)
return total
test |Description |# total runs |re
time(s) |re2
time(s) | % regex
time
weblog scan search and extract weblog data. |1 | 8.678 | 94.259 |1086.13%
i cant understand if this is related to RE2 or pyre2 or my code.
I also have a performance problem.
I have three version of programs:
- pure Python using built-in re module
- Python using Pyre2
- C/C++ using re2
I measure two time: regex pre-compile time and process time.
#1 program: 1.7s 1.5s #2 program: 0.04s 4.4s #3 program: 0.02s 0.8s
They all feed by the same regex and input.
Then I follow the documentation about profiling in Cython. Got the following result:
ncalls tottime percall cumtime percall filename:lineno(function) 652884 16.477 0.000 25.349 0.000 re2.pyx:394(_search) 9479 6.059 0.001 41.806 0.004 export_plain.py:60(match) 652884 4.243 0.000 33.602 0.000 {method 'search' of 're2.Pattern' objects} 652884 4.010 0.000 29.359 0.000 re2.pyx:442(search) 652884 3.056 0.000 3.056 0.000 re2.pyx:114(__init__) 652953 2.145 0.000 2.145 0.000 {isinstance} 652884 2.002 0.000 2.002 0.000 re2.pyx:123(__dealloc__) 652953 1.911 0.000 1.911 0.000 re2.pyx:75(unicode_to_bytestring) 652953 1.902 0.000 1.902 0.000 re2.pyx:86(pystring_to_bytestring) 1 0.330 0.330 42.492 42.492 export_plain.py:98(export_fname) 9479 0.173 0.000 0.173 0.000 {built-in method sub} 10000 0.120 0.000 0.120 0.000 {method 'split' of 'str' objects} 8967 0.063 0.000 0.099 0.000 re2.pyx:801(get) 10069 0.061 0.000 0.061 0.000 {method 'strip' of 'str' objects} 69 0.043 0.001 0.146 0.002 re2.pyx:806(prepare_pattern) 9036 0.038 0.000 0.038 0.000 re2.pyx:788(__next) 69 0.022 0.000 0.169 0.002 re2.pyx:905(_compile) 1 0.005 0.005 0.177 0.177 export_plain.py:36(load) 69 0.002 0.000 0.003 0.000 re2.pyx:784(__init__) 69 0.001 0.000 0.170 0.002 re2.pyx:763(compile) 38 0.001 0.000 0.001 0.000 {method 'write' of 'file' objects} 69 0.001 0.000 0.171 0.002 {re2.compile} 1 0.001 0.001 42.669 42.669 export_plain.py:160(main) 3 0.000 0.000 0.000 0.000 {open} 69 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 19 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 1 0.000 0.000 0.000 0.000 genericpath.py:38(isdir) 1 0.000 0.000 42.669 42.669 export_plain.py:153(run_re2_test) 1 0.000 0.000 0.000 0.000 {posix.stat} 4 0.000 0.000 0.000 0.000 {time.time} 1 0.000 0.000 0.000 0.000 posixpath.py:59(join) 1 0.000 0.000 42.670 42.670:1( ) 1 0.000 0.000 0.000 0.000 {method 'encode' of 'unicode' objects} 3 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 2 0.000 0.000 0.000 0.000 posixpath.py:109(basename) 1 0.000 0.000 0.000 0.000 posixpath.py:117(dirname) 1 0.000 0.000 0.000 0.000 stat.py:40(S_ISDIR) 2 0.000 0.000 0.000 0.000 {len} 1 0.000 0.000 0.000 0.000 {method 'extend' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'endswith' of 'str' objects} 1 0.000 0.000 0.000 0.000 stat.py:24(S_IFMT) 1 0.000 0.000 0.000 0.000 {method '__enter__' of 'file' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
It looks like that the _search function (re2.pyx:393) cost most of the time. I guess the actual Match function (re2.pyx:424) cost most of the time in this function.
result = self.re_pattern.Match(sp[0], <int>pos, <int>size, anchoring, m.matches, self.ngroups + 1)
but I don't how can be so different between this and the pure C version.
PS: Pyre2 revision : commit 543f228bc8 re2 revision : changeset: 79:0c439a6bd795
Played with re2/pyre2 recently to estimate its performance improvements over the standard re module, and found the noticeable slowdown instead:
In [1]: import re
In [2]: import re2
In [3]: r1 = re.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$')
In [4]: r2 = re2.compile(r'^(?:(?P<_db>databases/(?P<db_name>.+)/(?P<db_value>.+))|(?P<_col>collections/(?P<col_name>.+)/(?P<col_value>.+))|(?P<_u>users/(?P<u_name>.+)/(?P<u_value>.+))|(?P<_gr>groups/(?P<gr_name>.+)/(?P<gr_value>.+))|(?P<_comb>combinations/(?P<comb_name>.+)/(?P<comb_value>.+))|(?P<_ind>indexes/(?P<ind_name>.+)/(?P<ind_value>.+))|(?P<_inv>invoices/(?P<inv_name>.+)/(?P<inv_value>.+))|(?P<_pl>plans/(?P<pl_name>.+)/(?P<pl_value>.+))|(?P<_srv>services/(?P<srv_name>.+)/(?P<srv_value>.+))|(?P<_req>requests/(?P<req_name>.+)/(?P<req_value>.+)))$')
In [5]: %timeit r1.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ') 100000 loops, best of 3: 6.57 us per loop
In [6]: %timeit r2.match('invoices/qCMSrSo5aBcQ6rqS3a6J1VQvRx9PRAec/BVGqqoiA92ws2BqaAZNh0YagTpRUUvyZ') 1000 loops, best of 3: 272 us per loop
pyre2 is as of 543f228bc8a8b2f4ad535015c3915d0c84f09bd9 (Sun Feb 5 10:49:45 2012 -0500), re2 is as of 7bf5e87ef4e8 (Sun Oct 21 10:48:11 2012 -0400)