autocorrect
autocorrect copied to clipboard
Fine tuning and improving
Hi, First of all - this looks great. Thanks a lot. I compared 3 different packages (yours, pyspellchecker, textblob) and yours does the best.
How can I improve performance? Is there a way to finetune this to a specific data set?
Ah, great! ^^ I was planning to do this comparison myself, so I'm glad you already did it. Out of curiosity, could you post some results from that comparison here?
As for the performance, if you mean speed, I'm not sure, it's already pretty optimized. If you mean correction accuracy, I was thinking about adding a language model, so it would decide how to correct, based on context. It would be a great improvement but pretty heavy, and would take lots of work.
As for finetuning, you can follow instructions for adding new languages https://github.com/fsondej/autocorrect#adding-new-languages, but instead of running count_words
on wikipedia, run it on a textfile with your data, so if it's in Engligh, you would do:
from autocorrect.word_count import count_words
count_words('your_data.txt', 'en')
and then tar the output file:
tar -zcvf autocorrect/data/en.tar.gz word_count.json
It will replace default English dictionary with yours. For best accuracy you should also experiment with different threshold values (Speller(threshold=x
) and see which value works best.
Note that it's not really finetuning but retraining on your data from scratch, so you need a lot of data.
Hi @filyp Thanks for giving the provision to change the dict to own text file. And yours is the only package I found which replaces the word in sentences, else every other is working on 1 word at a time.
However, I am facing an error while doing tar to the output. Attaching screenshot. Can you please help?
Also, I see that it works fine till changes = 2, how to increase this? Eg: spell('NissSSan') returns 'Nissan' and spell('NissSSSan') returns 'NissSSSan'
Help will be really appreciated.
changes
higher than 2 isn't supported because that would be computationally expensive and the corrections would often be ambiguous
tar is a bash command and needs to be run in bash shell, not a python interpreter (google around to see how to use bash)
although there is a trick to use bash inside python interpreter, by using !
, so:
!tar -zcvf ...