dinglehopper icon indicating copy to clipboard operation
dinglehopper copied to clipboard

Display document page metadata

Open mikegerber opened this issue 5 years ago • 4 comments

ALTO files contains meta information like this:

<OCRProcessing ID="IdOcr">
  <ocrProcessingStep>
    <processingDateTime>2014-05-21</processingDateTime>
    <processingSoftware>
       <softwareCreator>ABBYY</softwareCreator>
       <softwareName>ABBYY FineReader Engine</softwareName>
      <softwareVersion>11</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

The report should display it.

mikegerber avatar Jun 09 '20 14:06 mikegerber

This would be very useful!

Unfortunately it will only work for ALTO though, since for PAGE-XML there is no such provenance but one rather has to fallback on the METS container instead.

Also note that the <OCRProcessing> structure has been changed to <Processing> and heavily modified as of ALTO version 4.0.

cneud avatar Sep 25 '20 22:09 cneud

For PAGE files:

    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.17.0</pc:Creator>
        <pc:Created>2020-10-02T09:13:28</pc:Created>
        <pc:LastChange>2020-10-02T09:13:28</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
            <pc:Labels>
                <pc:Label value="sauvola-ms-split" type="impl"/>
                <pc:Label value="0.34" type="k"/>
                <pc:Label value="0" type="win-size"/>
                <pc:Label value="0" type="dpi"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="layout/segmentation/region" value="ocrd-sbb-textline-detector">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/var/lib/textline_detection" type="model"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-calamari-recognize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/var/lib/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/*.ckpt.json" type="checkpoint"/>
                <pc:Label value="glyph" type="textequiv_level"/>
                <pc:Label value="confidence_voter_default_ctc" type="voter"/>
                <pc:Label value="0.001" type="glyph_conf_cutoff"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>

mikegerber avatar Oct 02 '20 15:10 mikegerber

But note that only PAGE files produced by OCR-D include this information - I am not aware of any other tool producing PAGE output currently populating this section in this way.

cneud avatar Oct 02 '20 16:10 cneud

Yeah, if it's not there it will not be displayed.

mikegerber avatar Oct 02 '20 16:10 mikegerber