dinglehopper
dinglehopper copied to clipboard
Add a parameter for selection of text level (PAGE XML)
Currently, dinglehopper
extracts text from PAGE XML files on the region level (https://github.com/qurator-spk/dinglehopper/blob/master/qurator/dinglehopper/ocr_files.py#L50). It would be wonderful if you could add a level-of-operation parameter to allow for extraction from line or word level. (Manual OCR correction is often done on a specific level and propagation of text through the different levels is not widely implemented, i.e. I only know of the Aletheia pro edition which does it in both directions)
I will definitely support line level because I'd like to display line images for OCR errors (a feature that would save me a lot of time), so this will come with that feature.
Do you have any "real" examples for word level extraction?
For our needs we added the option to select the level of extraction (region, line, word) and the index (which is necessary if you work with larex output). It is a qnd implementation (and still in testing) but maybe you can work with it: https://github.com/JKamlah/dinglehopper
Unfortunately I did some unmerged work on the text extraction due to #9 so that I need to reimplement your changes when my changes are merged.
Could you provide some example data where gtindex
and ocrindex
matters and explain the issue a bit? (I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]])
logic.)
Do you have any "real" examples for word level extraction?
@wrznr I'd really like some real world example for this :) I understand the feature request but I'd really like to see an example where the levels are not consistent and understand which software produced it.
@mikegerber Due to suboptimal GT transcription instructions I have a bunch of GT files where the corrected OCR text is only available on the word level. I could of course write a simple script which fixes this but if you could provide a general solution it would be great. Another use case: I perform OCR on ABBYY-segmented pages, keeping the ABBYY segmentation (which as a byproduct already has text in the regions). If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?
@wrznr I understand the use case, and I will implement it. I've done some work on the extraction due to #9 that I need to finish, next step is implementing this.
But: it would still be useful to have examples of this kind of inconsistent files to get a better feel of the problem :-) (Especially but not only for the index
issue @JKamlah is describing.)
Another use case: I perform OCR on ABBYY-segmented pages, keeping the ABBYY segmentation (which as a byproduct already has text in the regions). If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?
The OCR-D processors should update the regions text, otherwise it's a bug IMHO. (This is why I ask for example files. I can, for example, only guess that maybe you have a buggy processor or my understanding of correct behaviour is mistaken and maybe @JKamlah's problem is that there are multiple OCR results and he needs to select one.)
@wrznr E.g. Aletheia
(but only the Pro
edition) also has functionality included with which it should be possible to fix this. Anyway I second @mikegerber that it would be really helpful if you can share 1-2 such example files (email also possible if you don't want to/can't link or upload here) so we can have a closer look.
I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]]) logic
This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.
Here is another example, where the corrections were made only on the line level (with Aletheia). As @cneud wrote we could fix this with Aletheia-Pro or with a script as @wrznr suggested, but not every user have that background knowledge and so i would also prefer a general solution.
If the OCR-D OCR is stored in the word level (let's say for highlighting reasons), I have no chance to compare to GT, right?
With the new implementation it should work.
I have trouble understanding the getparent().attrib.get('index','-1') in ['-1', index]]) logic
This is due to the work with LAREX. We did some corrections with LAREX on line level to produce GT files. LAREX kept both the original text and the corrected text in the result file and separated them by index. The original text got the index 1 and the corrected ones index 0, not corrected lines got no index at all. I don't know if that is a LAREX specific procedure(?). Link to the LAREX example.
Thanks for the example and the explanation! Now that part of the feature request makes sense to me and this is indeed according to the PAGE specs:
Used for sort order in case multiple TextEquivs are defined. The text content with the lowest index should be interpreted as the main text content.
I'll open an extra issue for this.
Here is another example, where the corrections were made only on the line level (with Aletheia). As @cneud wrote we could fix this with Aletheia-Pro or with a script as @wrznr suggested, but not every user have that background knowledge and so i would also prefer a general solution.
Do the users have the background knowledge to understand that they need to extract from line level?
The LAREX example by @JKamlah also shows the line vs text region inconsistency, the TextRegion
's text is just empty:
<TextEquiv>
<Unicode/>
</TextEquiv>
As of f14ae468700cb390bfb42151277bd330c74854b2, you can now choose to extract from line level:
dinglehopper some-document.gt.page.xml some-document.ocr.page.xml --textequiv-level line
or OCR-D'ed:
ocrd-dinglehopper -I OCR-D-GT-PAGE,OCR-D-OCR-CALAMARI -O OCR-D-OCR-CALAMARI-EVAL -P textequiv_level line --overwrite
There is also a small tool now, to extract text for convenience:
dinglehopper-extract some-document.gt.page.xml --textequiv-level line
@JKamlah Could you check if this extracts the text for you as you expect it?
Reopening, I forgot the word level.
@wrznr Could you send me an example file where this matters? Because 1. I suspect some problems with whitespace in that case and 2. I have to consider using the word segmentation from the file instead of doing it from the full text.
The LAREX example by @JKamlah also shows the line vs text region inconsistency, the
TextRegion
's text is just empty:<TextEquiv> <Unicode/> </TextEquiv>
Regarding the empty TextEquiv
on region level in LAREX I also opened an issue in the OCR4All project: OCR4all/OCR4all#91.
Regarding the empty
TextEquiv
on region level in LAREX I also opened an issue in the OCR4All project: OCR4all/OCR4all#91.
Awesome, I couldn't make time yet to reproduce it myself, so I am thankful!
Note to self:
- [ ] Add LAREX file to tests
- [ ] Also review LAREX output again, with respect to multiple
TextEquiv
s. I remember seeing it putting versions of the strings there