python-hyperscan
python-hyperscan copied to clipboard
Memory leak in Database object when compiling, dumping and loading.
Thanks for the great library. We discovered a memory leak when a Database object is compiled, dumped, and loaded multiple times. This is increasingly worse when the database is growing larger. Even creating a new Database object every time will result in a leak.
Compiling one pattern in the same Database object:
import hyperscan as hs
db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
Creating a new Database object every compile, slows the leak down.
import hyperscan as hs
for i in range(100000):
db = hs.Database(mode=hs.HS_MODE_BLOCK)
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
But when the Database object is dumped (and loaded as well) it speeds the memory usage up.
import hyperscan as hs
db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)
I tried to dig into the C code to find the problem, but I do not have enough expertise to solve the issue. Can someone help or point me in the right direction. Thanks!
I am experiencing the same issue but unfortunately also not a C Wizzard.
this should be fixed in the latest release, but please confirm when you get a chance before I close this ticket.
apologies for the delay
Hi @darvid , I was just testing this with the latest version available from pypi, which seems to be v0.3.2. So I wanted to let you know v0.3.3 is not published on pypi yet :)
Hi @darvid, thanks for the memory leak fix. It resolves our first case, where we only compile the database in the loop X times.
However, when we are also dumping the database in the loop we see that after some iterations the memory starts growing.
import hyperscan as hs
db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)
The memory also starts growing if you load bytes into a database in the same variable as the last one.
import hyperscan as hs
db = hs.Database(mode=hs.HS_MODE_BLOCK)
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)
for i in range(100000):
db = hs.loadb(b)
As we use the same variable over again I expect Python to clean up the memory. I can sent you the plots if you're interested. Thanks again for looking into it!
ahhhh looks like the leak is with scratch not being freed now. here's a quick workaround while I push and release a fix:
import hyperscan as hs
db = hs.Database(mode=hs.HS_MODE_BLOCK)
db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)
for i in range(100000):
del db.scratch # dealloc scratch anytime you decref db
db = hs.loadb(b)
You are a wizard @darvid :smile:
Thank you very much!
@darvid , we missed a segmentation fault with the code you suggested. It was omitted by the memory profiler. Investigating further..
yeah looks like the source of the issue is the create_scratch
conditional scratch creation in loadb
. The actual workaround is calling loadb like this:
hs.loadb(db, create_scratch=False)
I think I'll just remove that parameter, because calling compile
will implicitly create a scratch if it's set to None
anyways.
let's keep this issue open until i push a release
It's leaking less memory with hs.loadb(b, create_scratch=False)
, around half of it, but it's still leaking.
I will keep digging into it.
hey, this should be fixed in the commit above, but i'm running into some annoying CI issues that have been taking a while to fix and prevent releasing to pypi (i'd rather fix the release pipeline than force publish by hand)
until then you can try to build from master, there's also a dockerfile that should be up-to-date and help with compiling hyperscan. hopefully i'll have the build issues fixed asap
This solves the issues, thanks a lot!
0.4.0 now on pypi, closing for now, hopefully no more memory leaks 😅
also really sorry for the inconvenience if you cloned the repo locally, i did a bunch of history rewriting and commit amending to reduce noise in the commit log while debugging the build process, so you may need to fresh clone next time.