tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Tesseract inserts extra blank lines

Open nezda opened this issue 6 years ago • 19 comments

Environment

  • Tesseract Version: 4.0.0 ~~4.0.0-beta.1 from https://packages.debian.org/stretch-backports/tesseract-ocr~~
  • Commit Number: 51316994ccae0b48692d547030f26c0969308214 ~~c3ed6f036064e54e34f75275f66c70dd924527bf~~
  • Platform: Linux my-machine 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux

Current Behavior:

Tesseract is adding extra blank lines to output. The input has no skew, odd fonts or formatting.

Test input is this simple input page produced from a pdf using pdftocairo -r 300 -tiffcompression lzw -tiff simple_page.pdf(the original content of this comes from a publicly available document) simple_page.tif.gz

Here's a screenshot of the above tiff to make the issue easier to quickly understand.

image

Output of tesseract -l eng simple_page.tif simple_page contains 12 blank lines when it should contain 3.

J.P. B. Vogel, Esq.

[email protected]

COATS, ROSE, YALE, RYMAN & LEE, P.C.
Two Lincoln Centre

5420 LBJ Freeway, Suite 600

Dallas, Texas 75240

Richard A, Fulton, Esq.
[email protected]

COATS, ROSE, YALE, RYMAN & LEE, P.C.
9 Greenway Plaza, Suite 1100

Houston, Texas 77046

Counsel for Plaintiff United States of America
For The Use and Benefit of EJ Smith Construction
Company, LLC

Keith A, Langley, Esq.
[email protected]
LANGLEY LLP

901 Main Street

Suite 600

Dallas, Texas 75202

Expected Behavior:

Output the same text except the extra blank lines.

J.P. B. Vogel, Esq.
[email protected]
COATS, ROSE, YALE, RYMAN & LEE, P.C.
Two Lincoln Centre
5420 LBJ Freeway, Suite 600
Dallas, Texas 75240

Richard A, Fulton, Esq.
[email protected]
COATS, ROSE, YALE, RYMAN & LEE, P.C.
9 Greenway Plaza, Suite 1100
Houston, Texas 77046

Counsel for Plaintiff United States of America
For The Use and Benefit of EJ Smith Construction
Company, LLC

Keith A, Langley, Esq.
[email protected]
LANGLEY LLP
901 Main Street
Suite 600
Dallas, Texas 75202

Suggested Fix:

Unknown

nezda avatar Jan 09 '19 23:01 nezda

4.0.0-beta.1 is old version. Please use the latest code when reporting issue.

zdenop avatar Jan 11 '19 20:01 zdenop

@zdenop I downloaded and built the official build and re-ran my test with identical results

nezda@my-machine:~/tesseract-4.0.0$ tesseract --version
tesseract 4.0.0
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2

 Found AVX2
 Found AVX
 Found SSE

nezda avatar Jan 11 '19 22:01 nezda

tesseract 4.0.0-333-gb3bd leptonica-1.76.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0

I can confirm this issue with all three traineddata with the latest code also.

Shreeshrii avatar Feb 21 '19 15:02 Shreeshrii

Check the output of hocr.

How many 'div' tags (blocks) ? how many 'p' tags (paragraphs) ?

amitdo avatar Feb 21 '19 16:02 amitdo

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 4.1.0-rc1-9-g49ed' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "2155.png"; bbox 0 0 460 615; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 22 29 382 169">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 22 29 173 46">
     <span class='ocr_line' id='line_1_1' title="bbox 22 29 173 46; baseline 0 -4; x_size 17; x_descenders 4; x_ascenders 3">
      <span class='ocrx_word' id='word_1_1' title='bbox 22 29 47 42; x_wconf 93'>J.P.</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 60 29 74 42; x_wconf 92'>B.</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 81 29 131 46; x_wconf 95'>Vogel,</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 139 29 173 46; x_wconf 93'>Esq.</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_2' lang='eng' title="bbox 23 53 216 71">
     <span class='ocr_line' id='line_1_2' title="bbox 23 53 216 71; baseline 0 -4; x_size 21.5; x_descenders 5.5; x_ascenders 5.5">
      <span class='ocrx_word' id='word_1_5' title='bbox 23 53 216 71; x_wconf 90'>[email protected]</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_3' lang='eng' title="bbox 22 79 382 117">
     <span class='ocr_line' id='line_1_3' title="bbox 22 79 382 94; baseline 0 -2; x_size 17.46875; x_descenders 4.4687505; x_ascenders 4.46875">
      <span class='ocrx_word' id='word_1_6' title='bbox 22 79 88 94; x_wconf 95'>COATS,</span>
      <span class='ocrx_word' id='word_1_7' title='bbox 96 79 150 94; x_wconf 96'>ROSE,</span>
      <span class='ocrx_word' id='word_1_8' title='bbox 158 79 208 94; x_wconf 96'>YALE,</span>
      <span class='ocrx_word' id='word_1_9' title='bbox 216 79 279 92; x_wconf 93'>RYMAN</span>
      <span class='ocrx_word' id='word_1_10' title='bbox 287 79 297 92; x_wconf 93'>&amp;</span>
      <span class='ocrx_word' id='word_1_11' title='bbox 304 79 341 94; x_wconf 92'>LEE,</span>
      <span class='ocrx_word' id='word_1_12' title='bbox 349 79 382 92; x_wconf 92'>P.C.</span>
     </span>
     <span class='ocr_line' id='line_1_4' title="bbox 22 104 179 117; baseline 0 0; x_size 18.238094; x_descenders 5.2380953; x_ascenders 3">
      <span class='ocrx_word' id='word_1_13' title='bbox 22 104 55 117; x_wconf 96'>Two</span>
      <span class='ocrx_word' id='word_1_14' title='bbox 62 104 118 117; x_wconf 95'>Lincoln</span>
      <span class='ocrx_word' id='word_1_15' title='bbox 125 104 179 117; x_wconf 94'>Centre</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_4' lang='eng' title="bbox 22 129 261 146">
     <span class='ocr_line' id='line_1_5' title="bbox 22 129 261 146; baseline 0 -4; x_size 17.46875; x_descenders 4.4687505; x_ascenders 4.46875">
      <span class='ocrx_word' id='word_1_16' title='bbox 22 129 61 142; x_wconf 93'>5420</span>
      <span class='ocrx_word' id='word_1_17' title='bbox 69 129 98 142; x_wconf 90'>LBJ</span>
      <span class='ocrx_word' id='word_1_18' title='bbox 106 129 178 146; x_wconf 95'>Freeway,</span>
      <span class='ocrx_word' id='word_1_19' title='bbox 186 129 226 142; x_wconf 96'>Suite</span>
      <span class='ocrx_word' id='word_1_20' title='bbox 232 129 261 142; x_wconf 96'>600</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_5' lang='eng' title="bbox 23 154 188 169">
     <span class='ocr_line' id='line_1_6' title="bbox 23 154 188 169; baseline 0 -2; x_size 18.238094; x_descenders 5.2380953; x_ascenders 3">
      <span class='ocrx_word' id='word_1_21' title='bbox 23 154 76 169; x_wconf 95'>Dallas,</span>
      <span class='ocrx_word' id='word_1_22' title='bbox 83 154 132 167; x_wconf 96'>Texas</span>
      <span class='ocrx_word' id='word_1_23' title='bbox 139 154 188 167; x_wconf 96'>75240</span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_2' title="bbox 22 204 378 320">
    <p class='ocr_par' id='par_1_6' lang='eng' title="bbox 23 204 209 246">
     <span class='ocr_line' id='line_1_7' title="bbox 23 204 204 221; baseline 0 -4; x_size 17.333332; x_descenders 4.333333; x_ascenders 4.3333335">
      <span class='ocrx_word' id='word_1_24' title='bbox 23 204 83 217; x_wconf 96'>Richard</span>
      <span class='ocrx_word' id='word_1_25' title='bbox 90 204 105 219; x_wconf 96'>A,</span>
      <span class='ocrx_word' id='word_1_26' title='bbox 114 204 167 220; x_wconf 96'>Fulton,</span>
      <span class='ocrx_word' id='word_1_27' title='bbox 175 204 204 221; x_wconf 87'>Esq.</span>
     </span>
     <span class='ocr_line' id='line_1_8' title="bbox 23 229 209 246; baseline 0 -3; x_size 20; x_descenders 5; x_ascenders 5">
      <span class='ocrx_word' id='word_1_28' title='bbox 23 229 209 246; x_wconf 90'>[email protected]</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_7' lang='eng' title="bbox 22 254 378 297">
     <span class='ocr_line' id='line_1_9' title="bbox 22 254 378 269; baseline -0.003 -1; x_size 17.333332; x_descenders 4.333333; x_ascenders 4.3333335">
      <span class='ocrx_word' id='word_1_29' title='bbox 22 254 88 269; x_wconf 96'>COATS,</span>
      <span class='ocrx_word' id='word_1_30' title='bbox 96 254 150 269; x_wconf 96'>ROSE,</span>
      <span class='ocrx_word' id='word_1_31' title='bbox 158 254 208 268; x_wconf 95'>YALE,</span>
      <span class='ocrx_word' id='word_1_32' title='bbox 216 254 279 267; x_wconf 92'>RYMAN</span>
      <span class='ocrx_word' id='word_1_33' title='bbox 287 254 297 268; x_wconf 92'>&amp;</span>
      <span class='ocrx_word' id='word_1_34' title='bbox 304 254 341 269; x_wconf 91'>LEE,</span>
      <span class='ocrx_word' id='word_1_35' title='bbox 349 254 378 268; x_wconf 74'>P.C.</span>
     </span>
     <span class='ocr_line' id='line_1_10' title="bbox 22 279 268 297; baseline 0 -4; x_size 17.333332; x_descenders 4.333333; x_ascenders 4.3333335">
      <span class='ocrx_word' id='word_1_36' title='bbox 22 280 31 293; x_wconf 96'>9</span>
      <span class='ocrx_word' id='word_1_37' title='bbox 38 279 120 297; x_wconf 96'>Greenway</span>
      <span class='ocrx_word' id='word_1_38' title='bbox 127 280 175 295; x_wconf 96'>Plaza,</span>
      <span class='ocrx_word' id='word_1_39' title='bbox 183 279 223 293; x_wconf 96'>Suite</span>
      <span class='ocrx_word' id='word_1_40' title='bbox 231 280 268 293; x_wconf 96'>1100</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_8' lang='eng' title="bbox 23 304 205 320">
     <span class='ocr_line' id='line_1_11' title="bbox 23 304 205 320; baseline -0.005 -2; x_size 17.333332; x_descenders 4.333333; x_ascenders 4.3333335">
      <span class='ocrx_word' id='word_1_41' title='bbox 23 304 93 320; x_wconf 96'>Houston,</span>
      <span class='ocrx_word' id='word_1_42' title='bbox 101 304 149 318; x_wconf 96'>Texas</span>
      <span class='ocrx_word' id='word_1_43' title='bbox 156 304 205 318; x_wconf 96'>77046</span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_3' title="bbox 22 354 427 422">
    <p class='ocr_par' id='par_1_9' lang='eng' title="bbox 22 354 427 422">
     <span class='ocr_line' id='line_1_12' title="bbox 22 354 388 368; baseline 0 0; x_size 17.46875; x_descenders 4.4687505; x_ascenders 4.46875">
      <span class='ocrx_word' id='word_1_44' title='bbox 22 354 87 368; x_wconf 95'>Counsel</span>
      <span class='ocrx_word' id='word_1_45' title='bbox 94 354 115 368; x_wconf 96'>for</span>
      <span class='ocrx_word' id='word_1_46' title='bbox 122 354 181 368; x_wconf 89'>Plaintiff</span>
      <span class='ocrx_word' id='word_1_47' title='bbox 187 355 238 368; x_wconf 96'>United</span>
      <span class='ocrx_word' id='word_1_48' title='bbox 245 354 295 368; x_wconf 96'>States</span>
      <span class='ocrx_word' id='word_1_49' title='bbox 302 354 317 368; x_wconf 96'>of</span>
      <span class='ocrx_word' id='word_1_50' title='bbox 322 355 388 368; x_wconf 96'>America</span>
     </span>
     <span class='ocr_line' id='line_1_13' title="bbox 23 380 427 393; baseline 0 0; x_size 18.238094; x_descenders 5.2380953; x_ascenders 3">
      <span class='ocrx_word' id='word_1_51' title='bbox 23 380 49 393; x_wconf 95'>For</span>
      <span class='ocrx_word' id='word_1_52' title='bbox 54 380 85 393; x_wconf 95'>The</span>
      <span class='ocrx_word' id='word_1_53' title='bbox 92 380 123 393; x_wconf 96'>Use</span>
      <span class='ocrx_word' id='word_1_54' title='bbox 129 380 158 393; x_wconf 96'>and</span>
      <span class='ocrx_word' id='word_1_55' title='bbox 166 380 221 393; x_wconf 95'>Benefit</span>
      <span class='ocrx_word' id='word_1_56' title='bbox 227 380 242 393; x_wconf 96'>of</span>
      <span class='ocrx_word' id='word_1_57' title='bbox 249 380 267 393; x_wconf 96'>EJ</span>
      <span class='ocrx_word' id='word_1_58' title='bbox 275 380 319 393; x_wconf 96'>Smith</span>
      <span class='ocrx_word' id='word_1_59' title='bbox 327 380 427 393; x_wconf 96'>Construction</span>
     </span>
     <span class='ocr_line' id='line_1_14' title="bbox 22 404 143 422; baseline 0 -4; x_size 17; x_descenders 3; x_ascenders 4">
      <span class='ocrx_word' id='word_1_60' title='bbox 22 404 103 422; x_wconf 94'>Company,</span>
      <span class='ocrx_word' id='word_1_61' title='bbox 111 404 143 418; x_wconf 95'>LLC</span>
     </span>
    </p>
   </div>
   <div class='ocr_carea' id='block_1_4' title="bbox 22 455 195 596">
    <p class='ocr_par' id='par_1_10' lang='eng' title="bbox 23 455 195 518">
     <span class='ocr_line' id='line_1_15' title="bbox 23 455 195 472; baseline 0 -4; x_size 17; x_descenders 4; x_ascenders 3">
      <span class='ocrx_word' id='word_1_62' title='bbox 23 455 62 468; x_wconf 96'>Keith</span>
      <span class='ocrx_word' id='word_1_63' title='bbox 69 455 84 470; x_wconf 91'>A,</span>
      <span class='ocrx_word' id='word_1_64' title='bbox 92 455 159 472; x_wconf 96'>Langley,</span>
      <span class='ocrx_word' id='word_1_65' title='bbox 167 455 195 471; x_wconf 96'>Esq</span>
     </span>
     <span class='ocr_line' id='line_1_16' title="bbox 23 480 175 497; baseline 0 -4; x_size 17; x_descenders 4; x_ascenders 3">
      <span class='ocrx_word' id='word_1_66' title='bbox 23 480 175 497; x_wconf 45'>[email protected]</span>
     </span>
     <span class='ocr_line' id='line_1_17' title="bbox 23 505 143 518; baseline 0 0; x_size 17.46875; x_descenders 4.4687505; x_ascenders 4.46875">
      <span class='ocrx_word' id='word_1_67' title='bbox 23 505 106 518; x_wconf 96'>LANGLEY</span>
      <span class='ocrx_word' id='word_1_68' title='bbox 113 505 143 518; x_wconf 95'>LLP</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_11' lang='eng' title="bbox 22 530 151 568">
     <span class='ocr_line' id='line_1_18' title="bbox 22 530 151 544; baseline -0.008 0; x_size 18.8125; x_descenders 4.8125005; x_ascenders 4.8125">
      <span class='ocrx_word' id='word_1_69' title='bbox 22 530 49 544; x_wconf 96'>901</span>
      <span class='ocrx_word' id='word_1_70' title='bbox 58 530 96 544; x_wconf 96'>Main</span>
      <span class='ocrx_word' id='word_1_71' title='bbox 103 530 151 544; x_wconf 96'>Street</span>
     </span>
     <span class='ocr_line' id='line_1_19' title="bbox 22 555 98 568; baseline 0 0; x_size 17.46875; x_descenders 4.4687505; x_ascenders 4.46875">
      <span class='ocrx_word' id='word_1_72' title='bbox 22 555 62 568; x_wconf 96'>Suite</span>
      <span class='ocrx_word' id='word_1_73' title='bbox 69 555 98 568; x_wconf 96'>600</span>
     </span>
    </p>

    <p class='ocr_par' id='par_1_12' lang='eng' title="bbox 23 580 188 596">
     <span class='ocr_line' id='line_1_20' title="bbox 23 580 188 596; baseline 0.006 -3; x_size 18.8125; x_descenders 4.8125005; x_ascenders 4.8125">
      <span class='ocrx_word' id='word_1_74' title='bbox 23 580 76 596; x_wconf 95'>Dallas,</span>
      <span class='ocrx_word' id='word_1_75' title='bbox 83 580 132 594; x_wconf 96'>Texas</span>
      <span class='ocrx_word' id='word_1_76' title='bbox 139 580 188 594; x_wconf 96'>75202</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

Shreeshrii avatar Feb 22 '19 11:02 Shreeshrii

It does the blocks detection right.

The paragraphs detection is wrong.

amitdo avatar Feb 22 '19 11:02 amitdo

https://github.com/tesseract-ocr/tesseract/blob/272ebf995f99d9c926ce0c951836f3fd1db90a87/src/ccmain/paragraphs.cpp

amitdo avatar Feb 22 '19 12:02 amitdo

similar issue reported at https://github.com/tesseract-ocr/tesseract/issues/2179

Shreeshrii avatar Mar 11 '19 18:03 Shreeshrii

Hello! I am still getting this error in tesseract v5.0.0-alpha.20200328 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

Anything I can do?

Rui Fontes

ruifontes avatar Apr 30 '20 21:04 ruifontes

Try setting the parameter paragraph_text_based to false.

amitdo avatar Apr 30 '20 23:04 amitdo

nice one @amitdo !

nezda@lukes-machina:Downloads$ tesseract --version
tesseract 4.1.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE

nezda@lukes-machina:Downloads$ cat tesseract_confs
paragraph_text_based false

nezda@lukes-machina:Downloads$ tesseract -l eng simple_page.tif - tesseract_confs
Page 1
J.P. B. Vogel, Esq.

[email protected]
COATS, ROSE, YALE, RYMAN & LEE, P.C.
Two Lincoln Centre
5420 LBJ Freeway, Suite 600
Dallas, Texas 75240

Richard A, Fulton, Esq.
[email protected]
COATS, ROSE, YALE, RYMAN & LEE, P.C.
9 Greenway Plaza, Suite 1100
Houston, Texas 77046

Counsel for Plaintiff United States of America
For The Use and Benefit of EJ Smith Construction
Company, LLC

Keith A, Langley, Esq.
[email protected]
LANGLEY LLP
901 Main Street
Suite 600
Dallas, Texas 75202

nezda avatar May 01 '20 12:05 nezda

@amitdo should paragraph_text_based be disabled by default?

nezda avatar May 01 '20 13:05 nezda

should paragraph_text_based be disabled by default

Do you mean for all users of Tesseract, or just for you and your input images?

Such a change in default value (for all users) needs massive testing on thousands of images with diverse languages and sources.

The layout analysis algorithm of Tesseract was design to deal with layouts of books and magazines.

amitdo avatar May 01 '20 13:05 amitdo

Sounds like the automated testing around that could use some beefing up. I'd help if you gave some pointers.

nezda avatar May 01 '20 14:05 nezda

Ray Smith from Google did extensive testing in the past. The testing images were not made public. Currently, Ray is not active in this project.

amitdo avatar May 01 '20 16:05 amitdo

I see @amitdo . Well thank you for resolving this for us and @nezda if you want help automating related testing at some point.

nezda avatar May 01 '20 17:05 nezda

Currently available unittests are in https://github.com/tesseract-ocr/tesseract/tree/master/unittest

You are most welcome to contribute additional ones.

Shreeshrii avatar May 01 '20 17:05 Shreeshrii

paragraph_text_based=false will only cause tesseract to skip some steps in its paragraphs detection phase.

Currently, there is no way (even using the API) to completely disable paragraphs detection.

amitdo avatar May 12 '20 22:05 amitdo

Hello, I'm using this version of Tesseract (the latest I guess) tesseract v5.0.0-alpha.2020032 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0

and I have the same issue. Trying to set paragraph_text_based to false does not seem to work, though. What might be the issue here ? Thanks

br4in1 avatar Jul 21 '20 08:07 br4in1