schema icon indicating copy to clipboard operation
schema copied to clipboard

ALTO - PAGE xml: Object mapping and possible transformation generation

Open Jo-CCS opened this issue 7 years ago • 7 comments

On face-2-face conference in Vienna the idea came up to generate a conversion between PAGE and ALTO as best-practice mapping between the different standard objects. If feasible, a transformation could be provided by XSLT.

The idea is to create a mapping on the latest ALTO version 4 to upcoming PAGE version in June and from there going backwards as far this makes sense.

Target is to get a common solution for mapping especially for objects where no exact matching is possible and workarounds or compromises need to be defined.

Jo-CCS avatar May 02 '18 07:05 Jo-CCS

Document with list of features here: Doc

chris1010010 avatar May 02 '18 15:05 chris1010010

I made a start here: prima-core-libs (Java) (XmlPageWriter_Alto.java) It can already convert the main things such as blocks, text lines, strings and glyphs with shapes. But there are many ToDos.

Some issues that need discussing:

  • Margins (LeftMargin, TopMargin etc.). How much are those used in practice? We could approximate by using bounding boxes.
  • SP element. How is that used typically?
  • HYP element. Difficult to do. We could look for hyphens in the text content. But is every hyphen at the end of a text line a HYP?
  • Text CONTENT. At the moment I assume top-to-bottom text line order and left-to-right word/glyph order to determine string/glyph content (needed if text is stored in regions or text lines in the PAGE file)
  • GraphicalElement type. According to the schema documentation this is for separating lines and rectangles. So for now I only map PAGE Separator to this. Anything other non-text is mapped to Illustration. Regions that have child regions are mapped to ComposedBlock.

The idea is to extend the JPageConverter to accept ALTO as target format. Already added but not tested: https://github.com/PRImA-Research-Lab/prima-page-converter

chris1010010 avatar Nov 06 '19 18:11 chris1010010

@chris1010010 This is great for a head start, many thanks! I will also circulate this within the @OCR-D community for comments and contributions.

cneud avatar Nov 06 '19 18:11 cneud

@cneud Happy to discuss priorities and sharing of work to keep the momentum. Thorough testing is a big chunk of work that can be easily distributed.

chris1010010 avatar Nov 07 '19 09:11 chris1010010

I made some progress in the Java converter. Open issues: SP, HYP, margins

chris1010010 avatar Nov 11 '19 17:11 chris1010010

FYI there is also ongoing work in the German OCR SIG to complete what Christian started, cf. https://github.com/maxnth/page-alto-ressources and https://github.com/maxnth/prima-core-libs/branches

cneud avatar Feb 14 '20 14:02 cneud

As per the 2021-04-29 Board Meeting, I am linking the ocrd-page-to-alto TODO list here, which gives a nice summary of missing equivalencies. Kudos to everyone who has worked on this.

artunit avatar May 05 '21 17:05 artunit