python-wordsegment Using with Additional corpus of spelling mistakes.

Using with Additional corpus of spelling mistakes.

Open willwade opened this issue 1 year ago • 3 comments

I’m pondering on using this as a service to an app for disabled people who we support who would use this to communicate. We see a lot of users who do this tapping on letters but often never use a space. But. We have a Snag in they do make errors. (See https://youtu.be/SDkE-aO3tOQ?si=0GAUyTKDh-q_sAxm and a quick app for iOS we made https://github.com/AceCentre/DragToSpeak and now contemplating using a rest api largely using word segment. )

So I was wondering about adding to the standard corpus with something like https://www.dcs.bbk.ac.uk/~ROGER/corpora.html

I read this https://stackoverflow.com/a/32364566/1123094

it looks like I can create a file of Bigrams or unigrans and weights and add to the standard corpus. Right? Or is there a better way.

Jan 18 '24 00:01 willwade

Modifying the unigrams and bigrams is the best way I can think of. You’ll have to account for every typo variation of every word though. There may be a way to modify the algorithm instead but I’m not sure.

Certainly AI models can do it but I don’t know about scope and scale.

Jan 18 '24 00:01 grantjenks

Yeah. GPT can do it - and multilingually. But it feels like a huge hammer to crack a nut. Thanks

Jan 18 '24 00:01 willwade

If anyone is interested I've got a complete modified unigrams json in this repo - and code to read in spelling mistakes here

https://github.com/AceCentre/Correct-A-Sentence/blob/main/helper-scripts/create_unigrams_spellingerrors.py

Dare say some madness in my logic. i am using the weights from the spelled correctly word which may be a bad idea.

NB: Can someone clarify something for me.. I've updated the unigrams json. Should I be updating the bigrams json, too, with the misspelling sentences, e.g. " </s/> alcohol": 541645.0," and add "</s/> alchol": 541645.0, " etc.

Jan 18 '24 11:01 willwade

python-wordsegment python-wordsegment copied to clipboard

Using with Additional corpus of spelling mistakes.

python-wordsegment
python-wordsegment copied to clipboard