ocrd_tesserocr icon indicating copy to clipboard operation
ocrd_tesserocr copied to clipboard

PAGE helper functions in recognize to generateDS API?

Open kba opened this issue 5 years ago • 1 comments

Now that the generateDS API has been refactored to make it easier to extend, IMHO it would be useful to have these functions available for all processors:

  • page_element_unicode0
  • page_element_float0
  • page_get_reading_order
  • page_update_higher_testequiv_level

kba avatar Jun 04 '20 12:06 kba

Agreed!

  • page_element_unicode0
  • page_element_conf0

Maybe these could go as member functions get_Unicode0 and get_conf0 into GlyphType, WordType, TextLineType and TextRegionType.

  • page_get_reading_order

I use this a lot, but it could be better: When in ocrd_page_generateds, then the function should

  • be named get_reading_order_dict or similar (as member of PageType)
  • include instantiating the first/top-level dict
  • include referencing the top-level get_ReadingOrder() and its get_OrderedGroup() or get_UnorderedGroup() (all robust to empty results)
  • page_update_higher_testequiv_level

Maybe we could trigger this automatically whenever a TextEquiv gets added anywhere and/or before serialization. (In a similar spirit to planned automatic coordinate sanitation.)

Anyway, the version here is the most complete so far, but it could be simplified with the new API in core.

I should also mention: page_add_to_reading_order

bertsky avatar Jun 04 '20 15:06 bertsky