GetCandidates shows candidates but FixFragment doesn't correct the sentence
Hi everybody,
I trained a model for spanish spellchecking and I'm using it to correct some ocr's files (I'm making a full process digitalizing some very old typewriter documents and want to enhance the text result). The problem I have now is that the model show me some candidates when I use GetCandidates but it doesn't change it when using FixFragment. I wonder if it's something todo with context (n-grams and so) or perhaps with the symbols that are in the sentece.
Here is an example: text = 'posee respectívámente en las localí&ades de Carlos M. Naón' corrector.GetCandidates(['localí&ades'],0) -> ('localidades', 'localí&ades', 'localicades', 'localizades') corrector.FixFragment(text) -> 'posee respectívamente en las local&ades de Carlos M. Naón'
It corrects "respectívámente" but it only erase the 'i' in "localí&ades". Maybe its about the tokens it uses to check when a word starts and ends.
Another example without special characters: text='con anterioridad a la sanción de la ley' corrector.FixFragment(text) -> 'con anterioridad la la canción de la ley'
It changes "sanción" with "canción" but if a GetCandidates, "sanción" is okey. corrector.GetCandidates(['sanción'],0) -> ('sanción', 'canción', 'sanación', 'sanchón', 'sención', 'anción', 'sancion', 'sanión', 'kanción', 'sunción', 'sandión', 'sancián', 'sansión', 'sanció')
I think this issue is similar to https://github.com/bakwc/JamSpell/issues/85
Thanks for the help!
Currently it don't expect special tokens inside words. You can try to replace all tokens inside words to some character - in this case it should start to correct them. I will think how to handle this case better.
Thanks, I'll try that. Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.
Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.
Yes, it should work even better. You can try on a small corpus first and let me know if it helps.