bricks icon indicating copy to clipboard operation
bricks copied to clipboard

[MODULE] - OCR autocorrection

Open jhoetter opened this issue 2 years ago • 4 comments
trafficstars

Please describe the module you would like to add to bricks Turn "t3st docunnent" into "test document"

Do you already have an implementation? No, but statistical distribution could be relevant for this. Or by looking up if the word occurs in a vocabulary (e.g. via nltk, ...), and if not, searching for the word with the smallest Levenshtein distance.

Additional context Tried this many years ago already, and there are simple but effective approaches for this.

jhoetter avatar Apr 05 '23 17:04 jhoetter

Interesting idea. Do you already have some resources on how to tackle this?

LeonardPuettmannKern avatar Apr 05 '23 19:04 LeonardPuettmannKern

Only what I've written above :D

jhoetter avatar Apr 05 '23 19:04 jhoetter

pyspellchecker implements this idea.

Just skimmed these, seems to use only algos no models.

EDIT: Uses Levenshtein distance in fact.

rasdani avatar Apr 06 '23 09:04 rasdani

Nice, that is great. I don't know yet if pyspellchecker supports different languages like e.g. German, but we can use this for English at least I guess. For German, Swedish, French, ... I think an approach using the language vocabulary would be helpful

jhoetter avatar Apr 06 '23 11:04 jhoetter