texthero icon indicating copy to clipboard operation
texthero copied to clipboard

Spell checker

Open selimelawwa opened this issue 4 years ago • 8 comments

First implementation of correct_mistakes and correct_spacing methods Added unit tests for both Implementation using symspellpy

closes #14

selimelawwa avatar Jun 04 '20 13:06 selimelawwa

Impressive work, thank you! Is there a situation where we want to correct the spacing but not the mistakes? correct_spacing might be redundant, isn't?

Also, without knowing the library I would say that "spacing mistakes" are still mistakes and therefore as default correct_mistakes should take care of spacing mistakes. Opinions?

[Edit]

Also, another interesting feature is to know of many mistakes there were in the Pandas Series in the first place (at the row level for instance). This might be helpful for quantifying how dirty a dataset is.

jbesomi avatar Jun 04 '20 13:06 jbesomi

Impressive work, thank you! Is there a situation where we want to correct the spacing but not the mistakes? correct_spacing might be redundant, isn't?

Yes correct_spacing might be redundant if there is no actual use case for it, i.e all users will require both spelling correction and space correction. But we should keep "_correct_spacing"

symspellpy library has 2 methods for correcting a string:

  • lookup_compund: can insert only a single space into a token (string fragment separated by existing spaces). It is intended for spelling correction of word segmented text but can fix an occasional missing space. There are fewer variants to generate and evaluate because of the single space restriction per token. Therefore it is faster and the quality of the correction is usually better.

  • word_segmentation: can insert as many spaces as required into a token. Therefore it is suitable also for long strings without any space. The drawback is a slower speed and correction quality, as many more potential variants exist, which need to be generated, evaluated and chosen from.

And yes fixing spaces SHOULD be part of correcting mistakes and it is actually done in sym_spell .lookup_compund however for a sentence like this "thequickbrownfoxjumpsoverthelazydog" lookup_compund doesn't fix it, as it needs more than one space in 1 token.

image

So to handle this issue, in text hero method "correct_mistakes" we have a parameter fix_spacing which by default is false, but if set to true will apply word_segmentation aka "_correct_word_spacing" to the string. But we keep this optional as not all text data will be similar to "thequickbrownfoxjumpsoverthelazydog" (requiring more than 1 space addition per token) and since the word_segmentation is slower we will only do it if user states that he needs it

fyi edit_distance is the number of changes (add,delete) needed to a string to become a valid word. spell correction algorithm suggests all known words within a given edit_distance when trying to suggest a word as a correction for word w, and suggestions are ordered by their frequency in language

selimelawwa avatar Jun 04 '20 14:06 selimelawwa

@mk2510 Can you please check what is failing in the build? I have resolved the merge conflicts and did the comments you requested. Unit tests are all succeeding.

Do you have any idea why build is failing?

Also it would be nice if we have a small chat today me, you and @jbesomi. We can have a short zoom call?

selimelawwa avatar Aug 11 '20 12:08 selimelawwa

@selimelawwa I am sorry, I saw your push so late. Would tomorrow evening CET suit you as well? It looks like you forgot to format your files with the format.sh script, which you can find in the script folder :octocat:

mk2510 avatar Aug 11 '20 19:08 mk2510

Hey guys,

thank you Max for your valuable feedback!

Regarding the PR overall, I'm not sure whereas we really need this feature right now ... it reminds me of #129 ... the overall objective of Texthero since this respective issue has been open (#14, 08.05.2020) changed a bit...

What's your opinion Max the introduction of this new feature? Shouldn't we work all together focusing on the API checklist (#85)?

An alternative might be to add a texthero.beta module where we can introduce and ask users to test these kinds of new features, opinions?

jbesomi avatar Aug 12 '20 12:08 jbesomi

@mk2510 @jbesomi Ok we can have a call at like 6-7 pm today what do you think?

selimelawwa avatar Aug 12 '20 14:08 selimelawwa

I can't today at 18. What about tomorrow at 18:00?

jbesomi avatar Aug 12 '20 14:08 jbesomi

Fine tomorrow at 18h

On Wednesday, August 12, 2020, Jonathan Besomi [email protected] wrote:

I can't today at 18. What about tomorrow at 18:00?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbesomi/texthero/pull/27#issuecomment-672923665, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4P6JSOGOPOK564TKBKBPTSAKUVPANCNFSM4NSULQIA .

selimelawwa avatar Aug 12 '20 16:08 selimelawwa