texthero
texthero copied to clipboard
Add correct_mistakes(s)
Or at least check how many mistakes in a sentence.
See: https://pypi.org/project/pyenchant/
@jbesomi I have check the library.
We can create a function count_mistakes
which can return number of mistakes per sentence.
For correcting mistakes, the library has a method suggest(word)
which returns list of suggestions for the given word. We can have a method correct_mistakes
that by default chooses the first word in the suggestions and replace the incorrect word with it? Do you have another suggestion for this?
Good idea. *return number of mistakes per pandas Series-row.
Ok but what about correct mistakes?
As you proposed is fine. Only thing, before going with pyenchant, would be great to select 2/3 similar package, test and rank them and finally implement count_mistakes and correct_mistakes.
Hi, I checked and these are the alternative options:
- symspellpy which is a python port to SymSpell
- spacy_hunspell
- pyspellchecker
These sources claim SymSpell should be the best in terms of performance (time):
With SymSpell We can implement automatic_correct_mistakes
but will be a bit more complicated than PyEnchant.
Please check and let me know your opinion.
Great. Both sources do not cite and do not benchmark pyenchant. Probably, we should test ourself both pyenchant and symspellpy both for quality of results and execution time and pick the best. In the end, we might decide to pick both and let the user decide. In this case, we would need anyways a benchmarking to understand which ones work best in which situation. What's your opinion Selim?
Sorry for late reply, We had holidays here in Egypt after Ramadan. Yeah I think we should test both too to be able to determine ourselves which is better and for which use case. However how do you suggest testing for the quality on of result for large data? I will start on them from tomorrow, keep you updated
No problem; thank you for your help! For the performance comparison, just pick a large NLP dataset and compare the execution time. For quality, I guess you need to look at the results yourself and decide.