jellyfish icon indicating copy to clipboard operation
jellyfish copied to clipboard

Adding QWERTY support to DL distance

Open DocShahrukh opened this issue 6 years ago • 4 comments

Adjusting cost in DL distance for QWERTY keypad mistakes, may be to others too. Please see if you're free.

key_pairs = [{'q','a'},{'q','w'},{'w','a'},{'w','e'},{'w','s'},{'e','s'},{'e','d'},{'e','r'},{'r','d'},{'r','f'},{'r','t'},{'t','g'},{'t','y'},{'y','g'},{'y','h'},{'y','u'},{'u','h'},{'u','j'},{'u','i'},{'i','j'},{'i','k'},{'i','o'},{'o','k'},{'o','l'},{'o','p'},{'l','k'},{'m','k'},{'m','n'},{'n','j'},{'n','b'},{'b','h'},{'b','v'},{'v','g'},{'v','c'},{'c','f'},{'c','x'},{'x','d'},{'x','z'},{'z','s'}]

def damerau_levenshtein_cost(a,b): if a==b : return 0 elif set([a,b]) in key_pairs: return .25 return 1

cost = damerau_levenshtein_cost(s1[i-1],s2[j-1])

I wish this hack finds some stack

DocShahrukh avatar Dec 21 '17 10:12 DocShahrukh

This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

jsfenfen avatar Dec 21 '17 17:12 jsfenfen

I think it’d make sense to make the cost function configurable, that’d let people do this for different layouts or ocr specific functions. I’d be glad to incorporate such a PR if anyone has time

On Thu, Dec 21, 2017 at 12:54 PM Jacob Fenton [email protected] wrote:

This is really interesting. @DocShahrukh https://github.com/docshahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/92#issuecomment-353415328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAfYjwm1znugDdT4oG1TdkkS1zuz5Onks5tCptcgaJpZM4RJpuI .

jamesturk avatar Dec 21 '17 22:12 jamesturk

Don't forget QWERTZ (Germany, Austria and Eastern Europe)!

DonaldTsang avatar Apr 16 '18 01:04 DonaldTsang

And AZERTY, Dvorak, and other layouts. Cost function really needs to be configurable, perhaps with a few standard costs.

This is really interesting. @DocShahrukh would this approach work for OCR errors too, assuming one came up with a useful weighting? So '1' paired with 'l', etc.

Indeed, see for example:

DimitriPapadopoulos avatar Dec 09 '21 10:12 DimitriPapadopoulos