ocr-fileformat icon indicating copy to clipboard operation
ocr-fileformat copied to clipboard

TEI support?

Open zuphilip opened this issue 8 years ago • 14 comments

Question from the workshop: Can we also add transformation to/from TEI?

My first impression was that the TEI format is normally used in different applications. But I learned that it is also possible to add x-y-coordinates of boxes in TEI. I haven't look deeper whether this is a suitable feature request...

I found a ALTO2TEI XSLT here: https://github.com/collex/typewright/blob/master/lib/saxon/AltoToTeiA.xsl (some fields are hardcoded for this project and they are writing about some other style sheet where they based theirs on).

zuphilip avatar May 14 '16 10:05 zuphilip

Also http://able.myspecies.info/abbyy-xml-tei-xml (looks a little special at first glance...)

zuphilip avatar May 14 '16 11:05 zuphilip

TEI is quite a big standard, lots of different flavors, so there are probably a lot of ways to implement it.

kba avatar May 14 '16 17:05 kba

It depends on what you want to achieve. If the primary goal is the transformation from TEI to ALTO for use in the DFG viewer, that reduces the complexity a lot because much data can simply be ignored.

stweil avatar May 14 '16 17:05 stweil

We don't have any use case for this at the moment. Maybe, we can just leave the issue open here and collect more information and any possible implementations by reusing some code. BTW I don't think that the technical implementation would be difficult, but reading and understanding format descriptions as well as testing with good examples.

There are a lot of transformation tools for TEI here: https://github.com/TEIC/Stylesheets but ALTO or ABBYY is not among them.

zuphilip avatar May 14 '16 17:05 zuphilip

Yes, let's keep this open and target the Dfg viewer, that seems feasible.

kba avatar May 15 '16 08:05 kba

Here's another ALTO to TEI XSL: https://github.com/emory-libraries/readux/blob/master/readux/books/ocr_to_teifacsimile.xsl

kba avatar May 16 '16 17:05 kba

See also this service which can convert various formats including ALTO to TEI: https://github.com/INL/OpenConvert

cneud avatar May 18 '16 20:05 cneud

Of interest: https://github.com/TEIC/Hackathon/blob/master/DH2015/xsl/hocr2tei.xsl

kba avatar Jun 13 '16 20:06 kba

PAGE2TEI https://github.com/dariok/page2tei

kba avatar Jun 17 '19 15:06 kba

Thank you @kba, that looks interesting as well! Let me know when anyone wants to work on integrating any of these transformation in ocr-fileformat.

zuphilip avatar Jun 17 '19 18:06 zuphilip

We don't have any use case for this at the moment.

Now we have a use case. We must convert 64833 TEI files (like this one) to ALTO for Kitodo Presentation / DFG Viewer.

stweil avatar Sep 05 '19 07:09 stweil

A first attempt on writing a XSLT can be found here but although it produces valid HOCR, the subsequent transformation to ALTO is not successful (most likely due to the lack of ocr_line in the HOCR file). I guess it would be possible to extract the document's line structure from jumps in the top-left coordinate of the words in a paragraph but I don't see an easy way on how to do this in XSLT. So maybe there will be a python script eventually...

jmechnich avatar Sep 05 '19 08:09 jmechnich

Nice! @jmechnich Can you create a PR? Then it is easier to discuss this further. But I am quite happy with such a XSLT transformation, even when there are no ocr_lines (they are AFAIK also missing in your TEI file).

zuphilip avatar Sep 06 '19 10:09 zuphilip

Several years later... 😏

Hi all, is this still an open issue as the PR has been merged without further discussion?

jmechnich avatar Apr 10 '24 20:04 jmechnich