ocr-fileformat "ocr-transform page alto ... ...": loosing text

Example page generated with OCR-D ocrd-calamari-recognize OCR_0007.zip

ocr-transform page hocr ... ... && ocr-transform hocr alto2.0 ... ... instead is loosing page size.

Feb 28 '20 11:02 jbarth-ubhd

no open() syscall on any /usr/local/share/ocr-fileformat/xslt/* when doing strace -f.

But calling execve("/usr/bin/java", ["java", "-jar", "/usr/local/share/ocr-fileformat/vendor/JPageConverter/PageConverter.jar", "-neg-coords", "toZero", "-source-xml", "OCR_0007.xml", "-target-xml", "xxx", "-convert-to", "ALTO"], 0x5614283d4a10 /* 24 vars */) = 0

Feb 28 '20 12:02 jbarth-ubhd

I've checked the docs of the most recent JPageConverter: -convert-to available versions:

LATEST
2013-07-15
2010-03-19
but not: ALTO ???

Feb 28 '20 12:02 jbarth-ubhd

Perhaps duplicate of https://github.com/PRImA-Research-Lab/prima-page-converter/issues/13

Feb 28 '20 13:02 jbarth-ubhd

Perhaps duplicate of PRImA-Research-Lab/prima-page-converter#13

Indeed, PAGE-ALTO conversion requires word segmentation. @maxnth Can you think of any sensible workaround?

Feb 28 '20 13:02 kba

Did a quick-and-dirty script: https://gist.github.com/jbarth-ubhd/0e867c20008639145386a7978fdb27a4

Feb 28 '20 14:02 jbarth-ubhd

Great but maybe we can integrate pseudo-word creation on-the-fly directly into the converter, with a cmdline flag.

Feb 28 '20 14:02 kba

Word level PAGE XML output for calamari has already been planned for some time now but sadly we didn't get to actually implementing it yet. It's one of my next tasks though and hopefully will get included in calamari within the upcoming month. I don't know whether that's too late for this specific case but maybe the info that the feature is being worked on might help anyways.

Feb 28 '20 18:02 maxnth

seems not to be fixed in v0.4.0.

Dec 21 '20 11:12 jbarth-ubhd

seems not to be fixed in v0.4.0.

ocrd_calamari is at 1.0.0 and calamari at 1.0.5 but word-level PAGE output is indeed not implemented yet in calamari AFAICT

Dec 21 '20 11:12 kba

ocrd_calamari (but AFAIK not Calamari yet) can produce word and glyph level segmentation since a year ago, it just does not do so by default. Sorry I didn't speak up earlier, I just didn't know about this issue here.

@jbarth-ubhd You need to set ocrd_calamari's parameter -P textequiv_level word.

Quoting ocrd_calamari's README:

In addition to the line text it may also output word and glyph segmentation including per-glyph confidence values and per-glyph alternative predictions as provided by the Calamari OCR engine, using a textequiv_level of word or glyph. Note that while Calamari does not provide word segmentation, this processor produces word segmentation inferred from text segmentation and the glyph positions. The provided glyph and word segmentation can be used for text extraction and highlighting, but is probably not useful for further image-based processing.

ocrd_calamari does more than Calamari here because we wanted to include Calamari's glyph level infos, i.e. character positions and alternative (less probable) character predictions; and as PAGE XML has a strict line>word>glyph hierarchy, we needed to include a word segmentation. This word segmentation is inferred from the text, e.g. "Lorem ipsum dolor sit amet" becomes "Lorem| |ipsum| |dolor| |sit| |amet", strictly on spaces as expected by OCR-D's validation.

Feb 05 '21 02:02 mikegerber

Indeed, PAGE-ALTO conversion requires word segmentation.

I wasn't aware of that until now, good to know! And good it's already in ocrd_calamari, albeit originally for an entirely different reason. 😀

Feb 05 '21 02:02 mikegerber

What prima-page-converter/ocr-fileformat could do, as far as I can tell from this issue: Give a user-friendly warning that there are no words in the PAGE document, so that ALTO conversion is not possible.

Feb 05 '21 12:02 mikegerber

No need for any of this, entirely, since we have been using https://github.com/kba/page-to-alto for this purpose instead since https://github.com/UB-Mannheim/ocr-fileformat/pull/134.

I suggest closing (cannot do it myself).

Jun 06 '23 14:06 bertsky

ocr-fileformat ocr-fileformat copied to clipboard

"ocr-transform page alto ... ...": loosing text

ocr-fileformat
ocr-fileformat copied to clipboard