pyahocorasick
pyahocorasick copied to clipboard
How to solve the problem of overlapping of mathing ?
The output will have overlapping betweet differnt phrases. How to solve the problem of overlapping? Is there any advice?
As shown the example bellow, I want to the results output:
- The longest phrase if overlapping : 'Saint Petersburg Town' among three results.
- The first one if the lengh is the same : Saint Petersburg between 'Saint Petersburg' and 'Petersburg Town'
def preprocess(text):
return '_{}_'.format(re.sub('[^a-z]', '_', text.lower()))
index = ahocorasick.Automaton()
for city in [ 'Petersburg Town', 'Saint Petersburg', 'Saint Petersburg Town']:
#print ( preprocess(city))
index.add_word(preprocess(city), city)
index.make_automaton()
def find_cities(text, searcher):
result = dict()
for end_index, city_name in searcher.iter(preprocess(text)):
end = end_index - 1
start = end - len(city_name)
occurrence_text = text[start:end]
result[(start, end)] = city_name
return result
print(find_cities( 'BEIJING and Saint Petersburg Town', index))
# outpout is : {(12, 28): 'Saint Petersburg', (12, 33): 'Saint Petersburg Town', (18, 33): 'Petersburg Town'}