solr-ocrhighlighting converter from hOCR to miniOCR

Do you have an internal tool for converting standard OCR formats to your MiniOCR format?

This tool would be useful as you would probably maintain it to match current MiniOCR specifics.

For example there is a limit when producing MiniOCR as we cannot add attributes to page <p> tag to match hOCR format (attribute title in out case). Indexing just fails when title is present. Is there a reason for such validation (performance or similar) ?

hOCR example

  <div class='ocr_page' id='1' title='image "alice1.png"; bbox 0 0 801 599; ppageno 0; scan_res 70 70'>

we tried

<p xml:id="page_identifier" title='alice1.png'>

Jul 26 '24 10:07 hrvoj3e

We do have one, but it's coupled with a bunch of internal stuff, so not a candidate for open sourcing at the moment.

Is there a reason for such validation (performance or similar) ?

Yeah, laziness 🙈 Or rather, hOCR was thought of as an extremely minimal format without anything that is not directly needed by the plugin, i.e. in the case of page tags, an identifier and the dimensions.

MiniOCR is really only intended for consumption by the plugin, and since a title is not needed, it's not supported by the parser.

Are you using MiniOCR outside of the Solr index as well? Or what do you need the title for?

Sep 13 '24 11:09 jbaiter

Hi. We wrote our own converter but have found out that there are exceptions that need to be handled so having a tool that works with many different hocr files would be great. We have 10+ years old hocr files and new ones so there are some differences.

Having a separate tool in a repo with issues and PRs would be great.

We are considering saving miniOCR to file and storing it alongside record for reindexing purposes. Having a title (or some other attr) would be great to save some specific data about a page from our app. I think you should just ignore it and not break parsing. :)

Hm... It just occured to me that I could insert a comment in miniOCR XML right after <p> to add some specific info. This will not break the parser - right?

ATM, we are using the approach of indexing miniOCR inside SOLR (not on disk).

Sep 13 '24 12:09 hrvoj3e

Thanks for the quick answer!

I think you should just ignore it and not break parsing. :)

Will do :-) Should be in the next release.

Hm... It just occured to me that I could insert a comment in miniOCR XML right after
to add some specific info. This will not break the parser - right?

Shouldn't break the parser at all, yes. For parsing the full document we use a proper XML parser that respects comments and doesn't care about extra attributes. But the page element is special, since we currently use a regex-based parser to quickly find and parse page headers in a match's context.

Sep 13 '24 12:09 jbaiter

I just merged a converter tool from hOCR and ALTO to MiniOCR, give it a try!

Oct 01 '24 14:10 jbaiter