texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Add correct_mistakes(s)

Open jbesomi opened this issue 4 years ago • 8 comments

Or at least check how many mistakes in a sentence.

See: https://pypi.org/project/pyenchant/

jbesomi avatar May 08 '20 16:05 jbesomi

@jbesomi I have check the library. We can create a function count_mistakes which can return number of mistakes per sentence.

For correcting mistakes, the library has a method suggest(word) which returns list of suggestions for the given word. We can have a method correct_mistakes that by default chooses the first word in the suggestions and replace the incorrect word with it? Do you have another suggestion for this?

selimelawwa avatar May 22 '20 21:05 selimelawwa

Good idea. *return number of mistakes per pandas Series-row.

jbesomi avatar May 22 '20 21:05 jbesomi

Ok but what about correct mistakes?

selimelawwa avatar May 22 '20 22:05 selimelawwa

As you proposed is fine. Only thing, before going with pyenchant, would be great to select 2/3 similar package, test and rank them and finally implement count_mistakes and correct_mistakes.

jbesomi avatar May 22 '20 22:05 jbesomi

Hi, I checked and these are the alternative options:

These sources claim SymSpell should be the best in terms of performance (time):

With SymSpell We can implement automatic_correct_mistakes but will be a bit more complicated than PyEnchant.

Please check and let me know your opinion.

selimelawwa avatar May 24 '20 18:05 selimelawwa

Great. Both sources do not cite and do not benchmark pyenchant. Probably, we should test ourself both pyenchant and symspellpy both for quality of results and execution time and pick the best. In the end, we might decide to pick both and let the user decide. In this case, we would need anyways a benchmarking to understand which ones work best in which situation. What's your opinion Selim?

jbesomi avatar May 25 '20 12:05 jbesomi

Sorry for late reply, We had holidays here in Egypt after Ramadan. Yeah I think we should test both too to be able to determine ourselves which is better and for which use case. However how do you suggest testing for the quality on of result for large data? I will start on them from tomorrow, keep you updated

selimelawwa avatar Jun 01 '20 22:06 selimelawwa

No problem; thank you for your help! For the performance comparison, just pick a large NLP dataset and compare the execution time. For quality, I guess you need to look at the results yourself and decide.

jbesomi avatar Jun 02 '20 05:06 jbesomi