symspellpy icon indicating copy to clipboard operation
symspellpy copied to clipboard

Correction doesn't prioritize bigram.

Open xcTorres opened this issue 3 years ago • 3 comments

There are sengkerang and selatan in gram dictionary, the frequency are 500. And there is tangkerang selatan in bigram dictionary, the frequency is 1200.

When correcting the address "Jalan Surabaya No.17, Tengkerang Selatan, Bukit Raya". The output is 'jalan surabaya no 17 sengkerang selatan bukit raya', But I expect Tengkerang Selatan to be corrected as Tangkerang Selatan bigram. Is it possible to do this?

xcTorres avatar Dec 05 '21 15:12 xcTorres

Is this similar to https://github.com/mammothb/symspellpy/issues/92?

I believe this is because bigrams are only used when a term from the input phrase is split up, e.g., when tengkerang is split to maybe tengke rang. So it doesn't actually look up a correction for the corresponding bigram in the input phrase, i.e., it doesn't find a correction for tengkerang selatan but for its individual words and maybe followed by some split. Relevant lines of code: Split individual input term Join suggestions for the splits into a bigram Compare and see if the bigram exists

The quickest workaround for this particular example would be to have tangkerang in the dictionary with a higher frequency than sengkerang. Perhaps you can use the frequency of the bigram to decide the frequency of tangkerang.

mammothb avatar Dec 06 '21 01:12 mammothb

Thanks for your explanation. Will have a try.

xcTorres avatar Dec 06 '21 13:12 xcTorres

@mammothb Do you plan to add this feature in symspellpy ? Or do we continue to have this "custom frequency logic"?

As we now understand were it comes from I might do a PR to fix that. For my usecase it's almost vital as I cannot afford false negative and bi-grams helps to solve that.

Thanks again for this nice lib, Have a great day :)

ierezell avatar Aug 01 '22 18:08 ierezell