tesseract
tesseract copied to clipboard
Inaccurate OCR results for lines with many dots
Environment
- Tesseract Version: 4.1.1
- Platform: MacOS Catalina 10.15.4
Current Behavior:
Few issues with tesseract -l eng --psm 1
on this image:
Some dots on lines ignored:
Cheese on Pasta...
Some dots on lines have strange letters, and are incorrectly capatilised:
SAUCE ON PASTA... cccceces cece cesses ces seeses cesses c
Numbers on both sides become strange text:
OW
AMNAURWNP,
©M~MURDUNBWNHE`
Here's the full PSM 1 text:
MY PASTA RESTAURANT
DISHES
Cheese on Pasta...
Cheesy Spaghetti...
OW AMNAURWNP
SPAghe ttn... .ccccececcseseeceeseeceeceesoesee ses sessee ses eesseseesaesaesees eeeesees es
SAUCE ON PASTA... cccceces cece cesses ces seeses cesses ces seesescaeseeeeseuueeces anes
Mega value cheese o on some Spicy Sauce... esate eeeees
Fresh and Tasty handmade assortments (FATS) cscscsocne
ANTIP ASTI... eee cee cee cee coe coeese ses couse ses cesses aes cesses caeses ses caeeaescaeeees
FreSh SIAW......ccescsseecee cesses coecas cesses see see cusses sue ses aecaecas ces case eenaee sees
NOOC1@S... 20. .s. cesses co cesses see cuecoe ces cusses cou cesses ace sue seseas cuecaeens ease senses es
©M~MURDUNBWNHE
tesseract -l eng --psm 12
is better:
-
no random capitalisation, apart from numbers on both sides turning into capitalised words.
-
3 of 9 lines have a single ellipsis, the rest have no dots.
Here's the full PSM 12 text:
MY PASTA RESTAURANT
DISHES
Spaghetti
Sauce on Pasta
Cheese on Pasta...
Cheesy Spaghetti...
Mega value cheese o on some Spicy Sauce...
Fresh and Tasty handmade assortments (FATS)
Antipasti
Fresh slaw
WON DUN BWHN PR
Noodles
OMAN DY BWN PR
Expected Behavior:
Expect dots to be OCR'd as dots, text output to look like text on input image.
I tried https://cloud.google.com/vision/docs/ocr on this with DOCUMENT_TEXT_DETECTION, giving:
"text": "MY PASTA RESTAURANT\nDISHES\n1. Spaghetti...........\n..........1\n2. Sauce on Pasta.........\n...........2\n3. Cheese on Pasta..........\n3\n4. Cheesy Spaghetti.............\n........4\n5. Mega value cheese on some Spicy Sauce....\n.5\n6. Fresh and Tasty handmade assortments (FATS)................6\n7. Antipasti.............\n..........7\n8. Fresh slaw.........\n8\n9. Noodles...........\n..9\n"
It looks generally very good, although there's a slight hiccup at:
7. Antipasti.............\n..........7\n8
and
5. Mega value cheese on some Spicy Sauce....\n.5\n6.
(new line shouldn't be there before the second 5
)
... and it is totally different tool (+paid).
Yep, just giving a better example of expected behaviour.
@IdiosApps have you resolved this issue?
@IdiosApps have you resolved this issue?
Nope
Is there any solution in preprocessing of the image on this issue. I have already tried some of the available things.
Has anyone found a workaround to fix this?
https://github.com/tesseract-ocr/tesseract/issues/3748#issuecomment-1032595980
See another example here: #4126
I created an Open CV preprocessing step that seems to pretty accurately remove leader dots and improve OCR output. More testing needed though, but in case that's useful for someone, here's the code:
https://colab.research.google.com/drive/18oe5tlZa2yXYCRfILN39IgYT0nfSgZQA#scrollTo=9dK6DxFX7EQ4&line=1&uniqifier=1