pyahocorasick how about the new approach base on 'Double Array Trie'

how about the new approach base on 'Double Array Trie'

Open SeekPoint opened this issue 3 years ago • 15 comments

Aho Corasick algorithm based on Double Array Trie https://github.com/hankcs/AhoCorasickDoubleArrayTrie

Jan 03 '21 14:01 SeekPoint

Looks interesting, thanks for the pointer!

Jan 03 '21 18:01 WojciechMula

here, the paper.. https://dl.acm.org/doi/10.1109/32.31365

waiting good news

Jan 04 '21 02:01 SeekPoint

In addition to the performance enhancements, this might also be a good candidate for a 'shared memory data structure' discussed here https://github.com/WojciechMula/pyahocorasick/issues/114

Jan 04 '21 03:01 Dobatymo

FWIW, @mischasan has had this other approach using interleaved arrays (for a long while) https://github.com/mischasan/aho-corasick https://docs.google.com/document/d/1e9Qbn22__togYgQ7PNyCz3YzIIVPKvrf8PCrFa74IFM/edit and is in C++. @WojciechMula take in this project is fairly tight and compact... IMHO any fundamental change to the core would need super careful testing and benchmarking as this is doubtful that they would perform any better until proven ;)

Jan 31 '21 09:01 pombredanne

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

Jan 31 '21 17:01 WojciechMula

Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source.

An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off.

What hurdles to that does managed memory (python) present? Or not relevant?

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

The only work I planned for acism was restoring correct backlink-trimming.

Anyway, all the best.

On 31/01/2021, Wojciech Muła [email protected] wrote:

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770418866

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 18:01 mischasan

You seem to refer to a prior conversation; if it's not private, could you point me at that?

On 31/01/2021, Mischa Sandberg [email protected] wrote:

Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source.

An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off.

What hurdles to that does managed memory (python) present? Or not relevant?

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

The only work I planned for acism was restoring correct backlink-trimming.

Anyway, all the best.

On 31/01/2021, Wojciech Muła [email protected] wrote:

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770418866

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 18:01 mischasan

Could you point me at the perf stats for pyaho? And at anything you thought was non-obvious smart about it?

rspamd (not my project) uses million-string match sets; acism fits them in <10MB -- necessary for a router-level process. Genome pattern matching (U of Queensland; can't remember the name) used similar huge sets. Did you have test case of those sizes?

On 31/01/2021, Mischa Sandberg [email protected] wrote:

Sympathy for your days of only 24 hours :-) I'm living the life of a single working mum, with two covid-vulnerable dependents. More power to you for contributing open source.

An interleaved array, and the breadth-first insertion of tree nodes, was a data-oriented design choice. Having one flat mem block shareable by multiple processes paid off.

What hurdles to that does managed memory (python) present? Or not relevant?

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

The only work I planned for acism was restoring correct backlink-trimming.

Anyway, all the best.

On 31/01/2021, Wojciech Muła [email protected] wrote:

That @mischasan project seems great!

I wish I could put more effort in this project, there are some obstacles I'm unable to overcome. Like, a day has 24 hours. :) I quite often think about improving the module, about trying different data structures, extending API etc. But this would require to allocate a lot of free time if it has to be done well.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770418866

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 19:01 mischasan

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

@mischasan I am the culprit for "approximate" word addition in the README (I did not write the feature) and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier

Jan 31 '21 19:01 pombredanne

Aucune problème.

(Mieux en anglais; en dépit d'être canadien :) The backlink pruning originally misimplemented (naïvement) in acism would apply to any aho-corasick tree. I'll think about that.

When it comes to (simple) wildcard matching, it's hard to beat flex/lex.

On 31/01/2021, Philippe Ombredanne [email protected] wrote:

Looked at your README: "... approximate (multi-pattern string search)" had me curious. spamd uses acism of exact chunks, to support regexes. What's your "approximate" about?

@mischasan I am the culprit for "approximate" addition and this is essentially because of the wildcard feature https://pyahocorasick.readthedocs.io/en/latest/#wildcard-search which I reckon may be not entirely correct as a qualifier

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770437734

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 19:01 mischasan

@mischasan I do not have hard stats, but I use pyahocorasick with a dataset of about ~250MB for about 21K patterns of very different sizes: some a few bytes and some 10's of KB. This is in https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data The main automaton has these stats:

$ python
Python 3.6.10 (default, Jun 13 2020, 08:53:46) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from licensedcode.cache import *
>>> get_index().rules_automaton.get_stats()
{'nodes_count': 3156589, 'words_count': 21370, 'longest_word': 14062, 'links_count': 3156588, 'sizeof_node': 32, 'total_size': 126263552}

... knowing that these are based on strings converted converted sequences of 16bit ints using a dictionary mapping known words to 16 bits ints, and this prior to being added to the Trie, so in my case this is likely more compact than if using the original bytes directly. 126MB is roughly the size of that automaton. I did not run explicit benchmarks on that, but it is anecdotally orders of magnitude faster than anything else I had tried before and since and that has a Python API.

Jan 31 '21 20:01 pombredanne

@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on https://github.com/mischasan/aho-corasick/issues/12#issuecomment-167447500 ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :)

Jan 31 '21 20:01 pombredanne

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

On 31/01/2021, Philippe Ombredanne [email protected] wrote:

@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on https://github.com/mischasan/aho-corasick/issues/12#issuecomment-167447500 ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770442572

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 21:01 mischasan

And looking back at the acism version you talked about (licensing), wow, five years just flew by. I've done just way too much proprietary work, related to db engines.

On 31/01/2021, Mischa Sandberg [email protected] wrote:

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

On 31/01/2021, Philippe Ombredanne [email protected] wrote:

@mischasan I guess that a long while ago I was considering building a Python wrapper for your C++ acism based on https://github.com/mischasan/aho-corasick/issues/12#issuecomment-167447500 ... but then I found Wojciech's pyahocorasick . Now it's all coming back to me :)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/WojciechMula/pyahocorasick/issues/137#issuecomment-770442572

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

-- Engineers think equations approximate reality. Physicists think reality approximates the equations. Mathematicians never make the connection.

Jan 31 '21 21:01 mischasan

Just to clarify: the input pattern strings are UTF-16; or just the internally-converted dictionary keys? And the dictionary mapping can handle up to 65535 strings?

@mischasan in my case the inputs are long open source license texts, medium notices and short mentions. They are broken into tokens (e.g. split in words where anything space or punctuation is a separator). Each word is looked up in a dictionary keyed by these words, and where the value is a 16 bits int. The sequence of all 16 bits values for a given input string (e.g. a license text) becomes the bytes fed to the automaton for indexing.

At search time, the input text to search patterns for (e.g. some text where we want to find a license) is subjected to the same transformation in a sequence of 2 bytes tokens then used for an Aho Corasick search.

That's a highly specialized use case of course.... the dictionary is basically mapping words known to be used in any license and relevant more or less "legally".... Think of these as the whole vocabulary of a language that I call "legalese". Once the dictionary transformation is done, words like "license" or "copyright" may be represented by xAB or x78 and are all concatenated back in a bytes string. After that step we have something plain from an ahocorasick point of view except that this is a 2bytes "alphabet" that's not ASCII and not Unicode or UTF, just an encoded bytes string.

And yes in my case the dictionary has up to 65535 known words.... But guess what legalese vocabulary is much poorer than that ;) ... ATM is it about 20000 words long and growing fairly slowly when new licenses and notices are added.

Jan 31 '21 22:01 pombredanne

pyahocorasick pyahocorasick copied to clipboard

how about the new approach base on 'Double Array Trie'

pyahocorasick
pyahocorasick copied to clipboard