pyahocorasick Library usage questions...

Hi,

I'm just starting to discover and use your library to extract terms defined in a thesaurus from an input text, and "highlight" them in a HTML output. It works quite well and quickly, anyway I still have a few issues:

is there any way to get only the first occurence of a given term when this term is present several times in the input text?
I have some use cases where a term can "contain" another one (for example, "Python", "IDE Python" and "Python language reference"). Actually, in this example, the automaton returns the term "Python" event when it is included into another matching term. Is there any way to select only one of them (for example the first matching term, or the longest one)?

Many thanks for any advise!

Best regards, Thierry

Dec 19 '18 08:12 tflorac

Hi Thierry, glad to hear you found pyahocorasick module useful.

There's no way to limit number of hits, you have to do this in your python code. But it seems to be an interesting feature. Would you open a feature request for this?

Automaton will always return all occurrences of strings from the set, even when they overlap. I think the approach from #21 is what you need, but it's not ready. Seems I'll need to back to this, you're another person who requests this.

Dec 19 '18 16:12 WojciechMula

Thanks for your reply! I will open another ticket for a feature request!! And if you need help for testing, just ask! ;)

Dec 19 '18 20:12 tflorac

@tflorac thank you, I always need help, as my day usually has 24h only :)

Dec 19 '18 20:12 WojciechMula

I'll complete my initial question with another one... I found a solution to only get the first found occurrence of a string, and I'm now looking for a solution to handle my second problem: how to handle use case where a term overlaps another one! It seems that when overlap occurs at the end of the longest string (for example, with the terms "Python" and "Advanced Python"), the longest one (here "Advanced Python") appears first, but with the same position as the second one ; so I think that I can easilly handle this case by remembering the last position at which I already found a term. Can you confirm this point? The problem persists when two terms overlap at the start ("Python" and "Python Book" for example); is there any rule in this case which defines the order in which found terms will be returned?

Dec 20 '18 10:12 tflorac

@tflorac I'm afraid there is no such rule. The automaton work this way that if it matches the last character of some word, it also can figure out other words with the same suffix. But there's no notion of "the first character of match".

BTW, what should be highlighted if you have set ["Advanced Python", "Python Book"] and the string is "Advanced Python Book"? There is partial overlap.

Dec 20 '18 19:12 WojciechMula

@WojciechMula: well... so at first I'm going to follow a simple process where if two terms (or more) match at the same position, the first one will win and will be selected!

I don't have any rule actually to answer to your question in the use case you give. I can't define any priority between the three terms, except by giving a "weight" to each term (statically or, for example, based on their respective length)... :-/

Dec 21 '18 07:12 tflorac

I also came here to request exactly this... I think the functionality "keep the longest match" will be very well received :)

@WojciechMula Oh and thanks for being so quick on responding to people's requests, I'm having a great time. I'm building a library on top of this: I pretty much finished it but only now realized the overlap "issue" :(

In my case I want to add 2 strings and count on the fact that only the longest one will be kept.

Dec 26 '18 20:12 kootenpv

I think this works as python logic, but it would be awesome if we could get the speed from c as I think this will be really slow:

from ahocorasick import Automaton

a = Automaton()

a.add_word("no", "no")
a.add_word("no thanks", "no thanks")
a.add_word("thanks", "thanks")

a.make_automaton()

ranges = [(x[0] - len(x[1]) + 1, x[0] + 1, x[1]) for x in a.iter("no thanks")]

# [(0, 2, 'no'), (0, 9, 'no thanks'), (3, 9, 'thanks')]

def remove_overlap(ranges):
    result = []
    current_stop = -1 
    for start, stop, value in ranges:
        if start > current_stop:
            # this segment starts after the last segment stops
            # just add a new segment
            result.append((stop - start, value))
            current_stop = stop
        elif stop - start > result[-1][0]:
            # segments overlap, replace
            result[-1] = (stop - start, value)
            # current_start already guaranteed to be lower
            current_stop = max(current_stop, stop)

    return [x[1] for x in result]

remove_overlap(ranges)
['no thanks']

In my case it seems that incorporating this code yields a ~15% slowdown.

Dec 26 '18 21:12 kootenpv

@kootenpv Thank you for this example! I think this request depends on #82.

Jan 06 '19 13:01 WojciechMula

pyahocorasick pyahocorasick copied to clipboard

Library usage questions...

pyahocorasick
pyahocorasick copied to clipboard