flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

Memory consumption/efficiency

Open Salfiii opened this issue 4 years ago • 2 comments

Hi,

I´m loading 4,6 million keywords + their replacements into flashtext.

The raw data in a pandas dataframe consumes approx. 1 GB RAM, profiled with pd.DataFrame.memory_usage(True, True) and guppy).

When I load this data to flashtext, the algorithm consumes 70 GB RAM.

My question is if this in intentionally and why the algorithm uses that much ram for a small dataset? Did I do something wrong or is this correct?

I´m also curious if there is a "Cloud native" way to use this algorithm? (Having the Data of the algorithm swapped out to a fast in-memory Database) to have a stateless algorithm where kubernets pods can be transfered/restarted withou a long waiting period or a huge memory footprint.

Thanks in advance.

Best regards

Florian

Salfiii avatar Sep 20 '19 09:09 Salfiii

In flashtext all keywords are stored in a trie dictionary tree structure, so the huge memory consumption is expected. For a trie with 26 alphabets, the least is 26 times.
Using short keywords might help to reduce the memory cost.

I am also interested in the "cloud native" way you described. Making flashtext a distributed program is what I wish it could extend.

wangpeipei90 avatar Oct 16 '19 20:10 wangpeipei90

Hi there, Just a few comments, @wangpeipei90

In flashtext all keywords are stored in a trie dictionary tree structure, so the huge memory consumption is expected

I don't think so, flashtext does use a trie data structure, but the trie is a prefix data structure, so prefixes are shared between different keywords. That should actually reduce consumption (in the general case at least), not increase it.

Concerning the "cloud native" part, the only experiment i had was passing KeywordProcessors to dask bag, something like

>>> from flashtext import KeywordProcessor
>>> import dask.bag as db

>>> b = db.from_sequence(['New York', 'Los Angeles'])
>>> kp = KeywordProcessor()
>>>  kp.add_keywords_from_dict({'NY' : ['New York'], 'LA': ['Los Angeles']})
>>> b.map(kp.replace_keywords).compute()
['NY', 'LA']

dask should make deep copies of the keyword processor, and communicate the copies to each worker

remiadon avatar Jul 29 '20 13:07 remiadon