ocrd_tesserocr icon indicating copy to clipboard operation
ocrd_tesserocr copied to clipboard

Segmentation on raw images

Open bertsky opened this issue 5 years ago • 3 comments

bertsky avatar Aug 24 '20 18:08 bertsky

Codecov Report

Merging #144 into master will increase coverage by 0.04%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #144      +/-   ##
==========================================
+ Coverage   37.73%   37.77%   +0.04%     
==========================================
  Files           9        9              
  Lines        1023      998      -25     
  Branches      216      212       -4     
==========================================
- Hits          386      377       -9     
+ Misses        565      555      -10     
+ Partials       72       66       -6     
Impacted Files Coverage Δ
ocrd_tesserocr/crop.py 13.51% <ø> (+0.78%) :arrow_up:
ocrd_tesserocr/segment_line.py 63.63% <ø> (-8.68%) :arrow_down:
ocrd_tesserocr/segment_region.py 53.64% <ø> (+4.21%) :arrow_up:
ocrd_tesserocr/segment_table.py 0.00% <0.00%> (ø)
ocrd_tesserocr/recognize.py 47.75% <0.00%> (-1.00%) :arrow_down:
ocrd_tesserocr/binarize.py 22.95% <0.00%> (+1.63%) :arrow_up:
ocrd_tesserocr/deskew.py 17.34% <0.00%> (+1.88%) :arrow_up:
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 24b7ced...2b3e8d6. Read the comment docs.

codecov[bot] avatar Aug 24 '20 18:08 codecov[bot]

This needs to be tested systematically. I expect to see both degradation and improvement, depending on how hard binarization is. See here for explanation.

bertsky avatar Aug 24 '20 18:08 bertsky

or perhaps should be parameterizable.

I thought about that, but at workflow configuration time, you have next to no chance of knowing which is going to be better. (I would guess that only input images which fare well under global Otsu are better off with the change. But we have no automatic indicator of binarization quality yet. In the very least, we should strive for some estimator based on local distribution of connected component statistics.)

But I still hope that we can fix the problem in Tesseract itself.

bertsky avatar Aug 25 '20 10:08 bertsky