pyahocorasick icon indicating copy to clipboard operation
pyahocorasick copied to clipboard

Return found key in Automaton.iter()

Open Pehat opened this issue 6 years ago • 10 comments

Hello! Thank you for the great library. I need to search multiple keys stored in the automaton in the input string. The number of keys is big and they overlap significantly, so the automaton greatly reduces memory consumption. Anyway, I need to get all (position, key) entries from the string. The documentation suggests to store keys in the associated value, but this approach nullifies all the benefits from storing all the keys in automaton. Is there any possibility to get found item's key in Automaton().iter()?

Pehat avatar Jun 19 '18 17:06 Pehat

AhoCorasick automaton is a simple DFA, thus during visiting nodes we would need to save somewhere the whole path which lead to the current, accepting state. At first glance it seems pretty complicated, but I haven't had time to check the idea in details. Maybe you, or somebody else, can point any paper which describes such application?

WojciechMula avatar Jun 21 '18 17:06 WojciechMula

If you use STORE_LENGTH you can combine the end position and length to get the portion of the haystack to slice.

EDIT: Although possibly a convenience method which does this would be nice.

frankier avatar Sep 18 '18 18:09 frankier

AhoCorasick automaton is a simple DFA, thus during visiting nodes we would need to save somewhere the whole path which lead to the current, accepting state. At first glance it seems pretty complicated, but I haven't had time to check the idea in details. Maybe you, or somebody else, can point any paper which describes such application?

I have briefly read Wikipedia article on Aho-Corasick. If I understand right, your DFA looks something like this: Aho-Courasick DFA

This looks like a special case of DFA - every node in it has its own parent. Therefore, if DFA is in accepting state, we can trace back the whole path up to the root and assembly the key (is reversed order) by simply following parent links. Please tell me if I make a mistake here.

Pehat avatar Sep 18 '18 21:09 Pehat

Hi Pehat,

Why can't you trace back in the string being searched, like I suggested?

frankier avatar Sep 19 '18 06:09 frankier

Hi Pehat,

Why can't you trace back in the string being searched, like I suggested?

Hi frankier,

I need to store associated Python objects in the trie, so I can't use this option.

Pehat avatar Sep 20 '18 18:09 Pehat

How about adding a tuple (length, current_value) to the Trie. It should use less space than storing the input string too. Does it solve your space problem?

frankier avatar Sep 20 '18 19:09 frankier

@Pehat I'm not sure if simple backtrace would do the job, because an accepting node might be reached via different paths. This is why I suggested that the matched string must be build during DFA walking.

@frankier's solution seems OK.

WojciechMula avatar Oct 01 '18 16:10 WojciechMula

I was thinking about this and to my knowledge it's simply not possible without extra memory overhead.

When you are at given node and going down the trie, you're simply appending letter to words. But if you need to go back via fail links you must remove k tailing letters from already collected string. The value k is unknown, it must be saved alongside fail link during automaton construction.

Another option (IMO easier and more reliable) would be allow to traverse from any node back to root node. But this require to store pointer from each node to its parent.

WojciechMula avatar Dec 11 '18 19:12 WojciechMula

So, seems to be doable at C level. But I don't want to increase memory consumption, as the structure is already big. I'd rather shrink nodes.

An option would be enable this feature using pre-processor definition, but then it must be somehow resolved on PIP (don't know how, any ideas?)

WojciechMula avatar Dec 11 '18 19:12 WojciechMula

I'm not sure if it is possible to use less memory if using C, but I could also imagine that in CPython we could do:

a.add_word("some_word", ("some_word", "some_normalized_value"))

Maybe the lib could take care of doing this implicitly with just , returning (end_index, key, value) behind the scenes, or even (start, end, key, value).

Actually, I find myself adding the length using:

word = "word"
automaton.add_word(word", (len(word), "value"))

So that at match time, and I get (end_index, (length, value)), I can do:

end_index - length

To receive the starting position, and to enable extraction of the match.

All the postprocessing I do make using this library 2x slower for me (and I'm using as optimized CPython as I can make it), trading speed for usability.

I wish I knew how to write C so I could help :(

kootenpv avatar Jan 06 '19 23:01 kootenpv

I think @frankier approach is the simplest. I use it too in https://github.com/nexB/scancode-toolkit/blob/0aa964e00d9655f27049e958813d1789edbcf13d/src/licensedcode/match_aho.py#L64

@Pehat you wrote:

The documentation suggests to store keys in the associated value, but this approach nullifies all the benefits from storing all the keys in automaton.

then you wrote:

I need to store associated Python objects in the trie, so I can't use this option.

These two statements may be conflicting? IMHO since you are already storing a lot, then just store the extra indexes too and reconstruct the original strings this way. Use a tuple or a list of tuples or else. Just store your objects and the pointers you need to get back the keys.

I am closing this now, but please reopen if you feel strongly about it!

pombredanne avatar Jan 14 '23 15:01 pombredanne