python-wordsegment
python-wordsegment copied to clipboard
Using with Additional corpus of spelling mistakes.
I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who do this tapping on letters but often never use a space. But. We have a Snag in they do make errors. (See https://youtu.be/SDkE-aO3tOQ?si=0GAUyTKDh-q_sAxm and a quick app for iOS we made https://github.com/AceCentre/DragToSpeak and now contemplating using a rest api largely using word segment. )
So I was wondering about adding to the standard corpus with something like https://www.dcs.bbk.ac.uk/~ROGER/corpora.html
I read this https://stackoverflow.com/a/32364566/1123094
it looks like I can create a file of Bigrams or unigrans and weights and add to the standard corpus. Right? Or is there a better way.
Modifying the unigrams and bigrams is the best way I can think of. You’ll have to account for every typo variation of every word though. There may be a way to modify the algorithm instead but I’m not sure.
Certainly AI models can do it but I don’t know about scope and scale.
Yeah. GPT can do it - and multilingually. But it feels like a huge hammer to crack a nut. Thanks
If anyone is interested I've got a complete modified unigrams json in this repo - and code to read in spelling mistakes here
https://github.com/AceCentre/Correct-A-Sentence/blob/main/helper-scripts/create_unigrams_spellingerrors.py
Dare say some madness in my logic. i am using the weights from the spelled correctly word which may be a bad idea.
NB: Can someone clarify something for me.. I've updated the unigrams json. Should I be updating the bigrams json, too, with the misspelling sentences, e.g. " </s/> alcohol": 541645.0," and add "</s/> alchol": 541645.0, " etc.