dinglehopper icon indicating copy to clipboard operation
dinglehopper copied to clipboard

Add a parameter for selection of text level (PAGE XML)

Open wrznr opened this issue 5 years ago • 17 comments

Currently, dinglehopper extracts text from PAGE XML files on the region level (https://github.com/qurator-spk/dinglehopper/blob/master/qurator/dinglehopper/ocr_files.py#L50). It would be wonderful if you could add a level-of-operation parameter to allow for extraction from line or word level. (Manual OCR correction is often done on a specific level and propagation of text through the different levels is not widely implemented, i.e. I only know of the Aletheia pro edition which does it in both directions)

wrznr avatar Nov 06 '19 13:11 wrznr

I will definitely support line level because I'd like to display line images for OCR errors (a feature that would save me a lot of time), so this will come with that feature.

Do you have any "real" examples for word level extraction?

mikegerber avatar Nov 06 '19 16:11 mikegerber

For our needs we added the option to select the level of extraction (region, line, word) and the index (which is necessary if you work with larex output). It is a qnd implementation (and still in testing) but maybe you can work with it: https://github.com/JKamlah/dinglehopper

JKamlah avatar Sep 17 '20 10:09 JKamlah

Unfortunately I did some unmerged work on the text extraction due to #9 so that I need to reimplement your changes when my changes are merged.

Could you provide some example data where gtindex and ocrindex matters and explain the issue a bit? (I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]]) logic.)

mikegerber avatar Oct 01 '20 11:10 mikegerber

Do you have any "real" examples for word level extraction?

@wrznr I'd really like some real world example for this :) I understand the feature request but I'd really like to see an example where the levels are not consistent and understand which software produced it.

mikegerber avatar Oct 01 '20 11:10 mikegerber

@mikegerber Due to suboptimal GT transcription instructions I have a bunch of GT files where the corrected OCR text is only available on the word level. I could of course write a simple script which fixes this but if you could provide a general solution it would be great. Another use case: I perform OCR on ABBYY-segmented pages, keeping the ABBYY segmentation (which as a byproduct already has text in the regions). If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?

wrznr avatar Oct 02 '20 15:10 wrznr

@wrznr I understand the use case, and I will implement it. I've done some work on the extraction due to #9 that I need to finish, next step is implementing this.

But: it would still be useful to have examples of this kind of inconsistent files to get a better feel of the problem :-) (Especially but not only for the index issue @JKamlah is describing.)

Another use case: I perform OCR on ABBYY-segmented pages, keeping the ABBYY segmentation (which as a byproduct already has text in the regions). If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?

The OCR-D processors should update the regions text, otherwise it's a bug IMHO. (This is why I ask for example files. I can, for example, only guess that maybe you have a buggy processor or my understanding of correct behaviour is mistaken and maybe @JKamlah's problem is that there are multiple OCR results and he needs to select one.)

mikegerber avatar Oct 02 '20 16:10 mikegerber

@wrznr E.g. Aletheia (but only the Pro edition) also has functionality included with which it should be possible to fix this. Anyway I second @mikegerber that it would be really helpful if you can share 1-2 such example files (email also possible if you don't want to/can't link or upload here) so we can have a closer look.

cneud avatar Oct 02 '20 17:10 cneud

I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]]) logic

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

Here is another example, where the corrections were made only on the line level (with Aletheia). As @cneud wrote we could fix this with Aletheia-Pro or with a script as @wrznr suggested, but not every user have that background knowledge and so i would also prefer a general solution.

If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?

With the new implementation it should work.

JKamlah avatar Oct 16 '20 11:10 JKamlah

I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]]) logic

This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.

Thanks for the example and the explanation! Now that part of the feature request makes sense to me and this is indeed according to the PAGE specs:

Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.

I'll open an extra issue for this.

Here is another example, where the corrections were made only on the line level (with Aletheia). As @cneud wrote we could fix this with Aletheia-Pro or with a script as @wrznr suggested, but not every user have that background knowledge and so i would also prefer a general solution.

Do the users have the background knowledge to understand that they need to extract from line level?

mikegerber avatar Oct 16 '20 11:10 mikegerber

The LAREX example by @JKamlah also shows the line vs text region inconsistency, the TextRegion's text is just empty:

<TextEquiv>
<Unicode/>
</TextEquiv>

mikegerber avatar Oct 16 '20 12:10 mikegerber

As of f14ae468700cb390bfb42151277bd330c74854b2, you can now choose to extract from line level:

dinglehopper some-document.gt.page.xml some-document.ocr.page.xml --textequiv-level line

or OCR-D'ed:

ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-CALAMARI -O OCR-D-OCR-CALAMARI-EVAL -P textequiv_level line --overwrite

mikegerber avatar Oct 21 '20 16:10 mikegerber

There is also a small tool now, to extract text for convenience:

dinglehopper-extract some-document.gt.page.xml --textequiv-level line

mikegerber avatar Oct 21 '20 16:10 mikegerber

@JKamlah Could you check if this extracts the text for you as you expect it?

mikegerber avatar Oct 21 '20 16:10 mikegerber

Reopening, I forgot the word level.

@wrznr Could you send me an example file where this matters? Because 1. I suspect some problems with whitespace in that case and 2. I have to consider using the word segmentation from the file instead of doing it from the full text.

mikegerber avatar Oct 22 '20 12:10 mikegerber

The LAREX example by @JKamlah also shows the line vs text region inconsistency, the TextRegion's text is just empty:

<TextEquiv>
<Unicode/>
</TextEquiv>

Regarding the empty TextEquiv on region level in LAREX I also opened an issue in the OCR4All project: OCR4all/OCR4all#91.

b2m avatar Oct 27 '20 11:10 b2m

Regarding the empty TextEquiv on region level in LAREX I also opened an issue in the OCR4All project: OCR4all/OCR4all#91.

Awesome, I couldn't make time yet to reproduce it myself, so I am thankful!

mikegerber avatar Oct 29 '20 15:10 mikegerber

Note to self:

  • [ ] Add LAREX file to tests
  • [ ] Also review LAREX output again, with respect to multiple TextEquivs. I remember seeing it putting versions of the strings there

mikegerber avatar Mar 02 '23 08:03 mikegerber