python-hyperscan Memory leak in Database object when compiling, dumping and loading.

Thanks for the great library. We discovered a memory leak when a Database object is compiled, dumped, and loaded multiple times. This is increasingly worse when the database is growing larger. Even creating a new Database object every time will result in a leak.

Compiling one pattern in the same Database object:

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])

case1

Creating a new Database object every compile, slows the leak down.

import hyperscan as hs

for i in range(100000):
    db = hs.Database(mode=hs.HS_MODE_BLOCK)
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])

case2

But when the Database object is dumped (and loaded as well) it speeds the memory usage up.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
    b = hs.dumpb(db)

case3

I tried to dig into the C code to find the problem, but I do not have enough expertise to solve the issue. Can someone help or point me in the right direction. Thanks!

Sep 07 '22 14:09 BartVerc

I am experiencing the same issue but unfortunately also not a C Wizzard.

Sep 08 '22 08:09 Ivaylo-Korakov

this should be fixed in the latest release, but please confirm when you get a chance before I close this ticket.

apologies for the delay

Dec 14 '22 16:12 darvid

Hi @darvid , I was just testing this with the latest version available from pypi, which seems to be v0.3.2. So I wanted to let you know v0.3.3 is not published on pypi yet :)

Jan 06 '23 15:01 BartVerc

Hi @darvid, thanks for the memory leak fix. It resolves our first case, where we only compile the database in the loop X times.

However, when we are also dumping the database in the loop we see that after some iterations the memory starts growing.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
    b = hs.dumpb(db)

The memory also starts growing if you load bytes into a database in the same variable as the last one.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)

db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)

for i in range(100000):
    db = hs.loadb(b)

As we use the same variable over again I expect Python to clean up the memory. I can sent you the plots if you're interested. Thanks again for looking into it!

Jan 10 '23 11:01 BartVerc

ahhhh looks like the leak is with scratch not being freed now. here's a quick workaround while I push and release a fix:

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)

db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)

for i in range(100000):
    del db.scratch  # dealloc scratch anytime you decref db
    db = hs.loadb(b)

Jan 10 '23 17:01 darvid

You are a wizard @darvid :smile:

Jan 10 '23 17:01 Ivaylo-Korakov

Thank you very much!

Jan 10 '23 17:01 BartVerc

@darvid , we missed a segmentation fault with the code you suggested. It was omitted by the memory profiler. Investigating further..

Jan 10 '23 17:01 BartVerc

yeah looks like the source of the issue is the create_scratch conditional scratch creation in loadb. The actual workaround is calling loadb like this:

hs.loadb(db, create_scratch=False)

I think I'll just remove that parameter, because calling compile will implicitly create a scratch if it's set to None anyways.

Jan 10 '23 17:01 darvid

let's keep this issue open until i push a release

Jan 10 '23 17:01 darvid

It's leaking less memory with hs.loadb(b, create_scratch=False), around half of it, but it's still leaking. I will keep digging into it.

Jan 10 '23 18:01 BartVerc

hey, this should be fixed in the commit above, but i'm running into some annoying CI issues that have been taking a while to fix and prevent releasing to pypi (i'd rather fix the release pipeline than force publish by hand)

until then you can try to build from master, there's also a dockerfile that should be up-to-date and help with compiling hyperscan. hopefully i'll have the build issues fixed asap

Jan 11 '23 06:01 darvid

This solves the issues, thanks a lot!

Jan 11 '23 11:01 BartVerc

0.4.0 now on pypi, closing for now, hopefully no more memory leaks 😅

Jan 11 '23 13:01 darvid

also really sorry for the inconvenience if you cloned the repo locally, i did a bunch of history rewriting and commit amending to reduce noise in the commit log while debugging the build process, so you may need to fresh clone next time.

Jan 11 '23 13:01 darvid

python-hyperscan python-hyperscan copied to clipboard

Memory leak in Database object when compiling, dumping and loading.

Compiling one pattern in the same Database object:

Creating a new Database object every compile, slows the leak down.

But when the Database object is dumped (and loaded as well) it speeds the memory usage up.

python-hyperscan
python-hyperscan copied to clipboard