pyahocorasick icon indicating copy to clipboard operation
pyahocorasick copied to clipboard

How to remove matchings that could not align word boundary?

Open zwd2080 opened this issue 2 years ago • 3 comments

The second matching (5, 'her' ) and the last one (14, 'she') are not aliging the word boundary, how to remove them ? or could we force them to mathcing word?

 for idx, key in enumerate('he her hers she'.split()):
    A.add_word(key,  key) # 
 A.make_automaton()
 needle = "he here her shes"
 list(A.iter_long(needle))
# [(1, 'he'), (5, 'her'), (10, 'her'), (14, 'she')]

zwd2080 avatar Jun 22 '22 17:06 zwd2080

Are you saying that you only want to have whole words matched? If so then you do not want to add strings characters as words, but rather sequence of words converted to numbers, otherwise the automaton will be on characters and will match characters: it does not know anything about words.

pombredanne avatar Jan 14 '23 12:01 pombredanne

Hi @pombredanne just to make sure I understand: the idea is that each unique word in the needles would map to a distinct int and we'd add these ints as keys and the words as the values?

Do you have a recommendation for this mapping? since the haystack will also need to mapped prior to iterating it with the same resulting map.

Thanks!

donatoaz avatar Jan 17 '23 13:01 donatoaz

@pombredanne

Can we get more info on this please. I want exact(whole) word match and I am not able to understand how to approach it. Any insights would be greatly appreciated

Thanks

explrA avatar Oct 02 '23 00:10 explrA