tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Inaccurate OCR results for lines with many dots

Open james-s-w-clark opened this issue 4 years ago • 9 comments

Environment

  • Tesseract Version: 4.1.1
  • Platform: MacOS Catalina 10.15.4

Current Behavior:

Few issues with tesseract -l eng --psm 1 on this image: ocr_input

Some dots on lines ignored: Cheese on Pasta... Some dots on lines have strange letters, and are incorrectly capatilised: SAUCE ON PASTA... cccceces cece cesses ces seeses cesses c Numbers on both sides become strange text: OW AMNAURWNP, ©M~MURDUNBWNHE`

Here's the full PSM 1 text:

MY PASTA RESTAURANT

DISHES

Cheese on Pasta...
Cheesy Spaghetti...

OW AMNAURWNP

SPAghe ttn... .ccccececcseseeceeseeceeceesoesee ses sessee ses eesseseesaesaesees eeeesees es
SAUCE ON PASTA... cccceces cece cesses ces seeses cesses ces seesescaeseeeeseuueeces anes

Mega value cheese o on some Spicy Sauce... esate eeeees
Fresh and Tasty handmade assortments (FATS) cscscsocne
ANTIP ASTI... eee cee cee cee coe coeese ses couse ses cesses aes cesses caeses ses caeeaescaeeees
FreSh SIAW......ccescsseecee cesses coecas cesses see see cusses sue ses aecaecas ces case eenaee sees
NOOC1@S... 20. .s. cesses co cesses see cuecoe ces cusses cou cesses ace sue seseas cuecaeens ease senses es

©M~MURDUNBWNHE

tesseract -l eng --psm 12 is better:

  • no random capitalisation, apart from numbers on both sides turning into capitalised words.

  • 3 of 9 lines have a single ellipsis, the rest have no dots.

Here's the full PSM 12 text:

MY PASTA RESTAURANT
DISHES
Spaghetti
Sauce on Pasta
Cheese on Pasta...
Cheesy Spaghetti...
Mega value cheese o on some Spicy Sauce...
Fresh and Tasty handmade assortments (FATS)
Antipasti
Fresh slaw
WON DUN BWHN PR
Noodles
OMAN DY BWN PR

Expected Behavior:

Expect dots to be OCR'd as dots, text output to look like text on input image.

james-s-w-clark avatar May 06 '20 17:05 james-s-w-clark

I tried https://cloud.google.com/vision/docs/ocr on this with DOCUMENT_TEXT_DETECTION, giving:

"text": "MY PASTA RESTAURANT\nDISHES\n1. Spaghetti...........\n..........1\n2. Sauce on Pasta.........\n...........2\n3. Cheese on Pasta..........\n3\n4. Cheesy Spaghetti.............\n........4\n5. Mega value cheese on some Spicy Sauce....\n.5\n6. Fresh and Tasty handmade assortments (FATS)................6\n7. Antipasti.............\n..........7\n8. Fresh slaw.........\n8\n9. Noodles...........\n..9\n"

It looks generally very good, although there's a slight hiccup at: 7. Antipasti.............\n..........7\n8 and 5. Mega value cheese on some Spicy Sauce....\n.5\n6. (new line shouldn't be there before the second 5)

james-s-w-clark avatar May 19 '20 10:05 james-s-w-clark

... and it is totally different tool (+paid).

zdenop avatar May 19 '20 10:05 zdenop

Yep, just giving a better example of expected behaviour.

james-s-w-clark avatar May 19 '20 10:05 james-s-w-clark

@IdiosApps have you resolved this issue?

fazil-imraan avatar Mar 19 '21 07:03 fazil-imraan

@IdiosApps have you resolved this issue?

Nope

james-s-w-clark avatar Mar 30 '21 16:03 james-s-w-clark

Is there any solution in preprocessing of the image on this issue. I have already tried some of the available things.

Sustainability4 avatar Jun 29 '21 12:06 Sustainability4

Has anyone found a workaround to fix this?

naourass avatar Aug 18 '22 11:08 naourass

https://github.com/tesseract-ocr/tesseract/issues/3748#issuecomment-1032595980

amitdo avatar Sep 24 '23 08:09 amitdo

See another example here: #4126

I created an Open CV preprocessing step that seems to pretty accurately remove leader dots and improve OCR output. More testing needed though, but in case that's useful for someone, here's the code:

https://colab.research.google.com/drive/18oe5tlZa2yXYCRfILN39IgYT0nfSgZQA#scrollTo=9dK6DxFX7EQ4&line=1&uniqifier=1

cdrini avatar Sep 25 '23 16:09 cdrini