flashtext icon indicating copy to clipboard operation
flashtext copied to clipboard

can't search overlapped words?

Open xuexcy opened this issue 5 years ago • 5 comments

kp = KeywordProcessor() kp.add_keyword("ABC DE") kp.add_keyword("DE FGHI") kp.extract_keywords("ABC DE FGHI")

['ABC DE'] why not ['ABC DE', 'DE FGHI']

xuexcy avatar Oct 12 '18 09:10 xuexcy

Second this. Is this a limitation of the algorithm, or a simple bug? If it is the former then it should at least be documented on usage notes.

jdclarke5 avatar Oct 16 '18 06:10 jdclarke5

I was stuck at this too, and I tweaked the algorithm to match overlapping patterns. I will try to submit a pull request soon!

aneeshvartakavi avatar Dec 12 '18 07:12 aneeshvartakavi

I suspect that if we reverse the document and conduct keyword matching in the reversed order, we can get both.

document = "ABC DE FGHI"
keywords = ["ABC DE", "DE FGHI"]

def extract_overlapping_keywords(document, keywords):
    res = []
    kp = KeywordProcessor()
    kp.add_keywords_from_list(keywords)
    forward_extractions = kp.extract_keywords(document)
    print("Forward extraction:", forward_extractions)
    res.extend(forward_extractions)
    
    reversed_keywords = [" ".join(keyword.split(" ")[::-1]) for keyword in keywords]
    reversed_kp = KeywordProcessor()
    reversed_kp.add_keywords_from_list(reversed_keywords)    
    reversed_document = " ".join(document.split(" ")[::-1])
    tmp = reversed_kp.extract_keywords(reversed_document)
    reversed_extraction = [" ".join(keyword.split(" ")[::-1]) for keyword in tmp]
    print("Backword segmentation:", reversed_extraction)
    res.extend(reversed_extraction)
    
    return res

extract_overlapping_keywords(document, keywords)

mickeysjm avatar Apr 03 '19 19:04 mickeysjm

Plus on on this

Vineeth-Mohan avatar Apr 14 '19 14:04 Vineeth-Mohan

Keyword matching in the reversed order won't work if the keywords are more than 3. For example, document = "ABC DEF GHI JKL" keywords = ["ABC DEF", "DEF GHI", "GHI JK"] In both forward and backward direction, we get only "ABC DEF" and "GHI JK"

wangpeipei90 avatar Sep 05 '19 15:09 wangpeipei90