kraken icon indicating copy to clipboard operation
kraken copied to clipboard

The ocr results and reported model accuracy mismatch

Open novacellus opened this issue 2 years ago • 10 comments

I have trained a kraken recognition model on 20 PAGE XML files transcribed manually with Transkribus. Up to the evaluation stage everything seems to work just fine and ketos test -m some_best.mlmodel -f page pagexml_trans.xml reports ca. 99% accuracy: === report === 4852 Characters 15 Errors 99.69% Accuracy 4 Insertions 6 Deletions 5 Substitutions Count Missed %Right 3881 5 99.87% Latin 971 4 99.59% Common Yet. the output of the kraken -i pagexml_trans.jpg pagexml_trans.txt ocr -m some_best.mlmodel with the same model applied to the same file is garbage: 4 qʒ B segce..s. IgouAdI. uad. Ia. ncade The results are just as bad whether the image is binarized or not.

  1. What am I doing wrong?
  2. Is there any way to inspect the model files in order to reproduce in recognition the parameters used during the train phase?

novacellus avatar Feb 17 '22 23:02 novacellus

Are you segmenting the page? The exact command line you're running should throw an error (Failed processing pagexml_trans.jpg: No line segmentation given. Add one with the input or run `segment` first.) but broken output like that usually indicates that you're running either on the wrong type of image (grayscale and binarized mixup) or a missing segmentationn.

mittagessen avatar Feb 18 '22 00:02 mittagessen

Yes, I am. In fact, the command should read: kraken -i pagexml_trans.jpg pagexml_trans.txt **segment -bl** ocr -m some_best.mlmodel. I've experimented with kraken-trained seg models as well, but to no avail. The graphic files in both training and recognition phase are (IM's indetify output): JPEG 4076x2854 4076x2854+0+0 8-bit sRGB 1.17259MiB 0.000u 0:00.000. However, whether the input files are binarized or not, the result is the same. Anyways, thank you for your help! 839309_0026_31120457 839309_0026_31120457.txt (the uploaded TXT file is actually an XML, so the extension needs to be modified)

novacellus avatar Feb 18 '22 01:02 novacellus

Once you get to more senseful output, you also need to take the following into consideration if you are training on pages segmented by transkribus. The segmentation of kraken is different. And you do need to repolygonize the pages annotated in transkribus and then check whether the polygon includes all lines because transkribus frequently puts the baseline too low and then train on these repolygonized pages with corrected lines for those where the polygon failed because transkribus annotated it too low.

dstoekl avatar Feb 18 '22 08:02 dstoekl

I've tried that one, too. Running both train --repolygonize and contrib/repolygonize.py on the XML I've attached before yields this error:

  File "repolygonize.py", line 63, in _repl_page
    pol.attrib['points'] = ' '.join([','.join([str(x) for x in pt]) for pt in o[idx]])
IndexError: list index out of range

The parameters (-tl, -cl, -bl) don't change the output.

Once you get to more senseful output

You mean how the text has been transcribed?

novacellus avatar Feb 18 '22 10:02 novacellus

I have tried to import your pagexml into eScriptorium. The imagefilename needs to get adapted to the imagename github gave your image but the system also gives another error that I haven't understood yet.

Once you get to more senseful output - I mean once you get away from complete gibbrish to readable output. If you train recognition on segmentation of one system and then infer recognition on segmentation of another system there is a gap between training data and inference data.

dstoekl avatar Feb 18 '22 10:02 dstoekl

I have tried to import your pagexml into eScriptorium. The imagefilename needs to get adapted to the imagename github gave your image but the system also gives another error that I haven't understood yet.

I should have post an external link. Sorry for that. Thank you for checking on it, let me know if I can be of any help. In the meantime, I was trying to understand why the repolygonize.py script is complaining about my XML.

for line in lines: pol = line.find('./{*}Coords') if pol is not None: pol.attrib['points'] = ' '.join([','.join([str(x) for x in pt]) for pt in o[idx]]) idx += 1

is not happy about my coords, but they seem pretty normal to me:

<Coords points="623,395 682,395 741,395 800,395 859,395 918,395 977,395 1036,395 1095,397 1154,397 1213,397 1272,397 1331,397 1390,397 1449,397 1508,397 1567,397 1626,398 1685,398 1744,398 1803,398 1803,364 1744,364 1685,364 1626,364 1567,363 1508,363 1449,363 1390,363 1331,363 1272,363 1213,363 1154,363 1095,363 1036,361 977,361 918,361 859,361 800,361 741,361 682,361 623,361"/>

novacellus avatar Feb 18 '22 10:02 novacellus

do you know which line is problematic? if yes, pls post boundary and baseline.

dstoekl avatar Feb 18 '22 11:02 dstoekl

I'm afraid I dont. The error message doesn't point to the input file line. However, I'll try to debug it starting with the lib/xml.py which feeds the repolygonize.py.

novacellus avatar Feb 18 '22 11:02 novacellus

I've gone through the repolygonize.py. There's a mismatch between the list of lines retrieved from the XML and the list of normalized polygons: line 45 lines = doc.findall('.//{*}TextLine') line 87 o = calculate_polygonal_environment(im, l, scale=(1800, 0), topline=topline) The o list contains 1 element less than the lines. ~~I'm not sure which line is missing, though.~~ The missing TextLine is:

<TextRegion id="region_1637667637822_18" custom="readingOrder {index:2;} structure {type:[drop-capital];}">
            <Coords points="648,1033 878,1033 878,1209 648,1209"/>
            <TextLine id="line_1637667653605_24" custom="readingOrder {index:0;}">
                <Coords points="680,1051 846,1051 846,1184 680,1184"/>
                <TextEquiv>
                    <Unicode>t</Unicode>
                </TextEquiv>
            </TextLine>
            <TextEquiv>
                <Unicode>t</Unicode>
            </TextEquiv>
</TextRegion>

This is an initial letter: the region overlaps with the larger one the letter is embedded in, hence reduced number of coordinates. obraz

I checked other files as well: the error occurs where there's no <Baseline> in the <TextLine> element.

novacellus avatar Feb 18 '22 13:02 novacellus

You could try changing the path in line 58 to .//{*}TextLine[{*}Baseline], then it should only repolygonize lines with baselines.

(This is assuming you work with PAGE. Line 45 is for ALTO and would probably need to be .//{*}TextLine[@BASELINE].)

jjarosch avatar May 31 '22 00:05 jjarosch