manifold icon indicating copy to clipboard operation
manifold copied to clipboard

Possible to ingest XML/TEI files as texts?

Open rhoadsstevens opened this issue 4 years ago • 10 comments

I am wondering if, in the future, it might be possible to ingest XML files encoded with TEI (Text Encoding Initiative) tags as Texts into Manifold. At the University of Washington, there are some professors in the Humanities who teach TEI, and they are looking for ways to make digital editions of that work. In the next year, there might also be a Digital Textual Studies program at the UW, and in that, too, TEI would figure prominently.

Sometimes, in these classes, once students have their XML/TEI, they learn about XSLT and HTML and CSS in order to make digital editions, but it would be cool if Manifold could do all the styling on the back end and make simple, clean, nice pages out of XML/TEI.

Other times, the students don't learn about XSLT, HTML, and CSS, and, instead, they use sites like these to get a really quick look at what their TEI could look like as a webpage:

https://teipublisher.com/index.html http://www.tapasproject.org/

So it would be very exciting if XML/TEI could be a text to ingest in Manifold.

rhoadsstevens avatar Dec 09 '20 18:12 rhoadsstevens

This has come up a number of times, so there's some demand for this feature, although it's not currently on our roadmap.

One way to come at this would be to leverage an existing library or tool for transforming TEI XML files into HTML+CSS. Our ingestion system is built around strategies, and strategies can leverage existing tools to transform the ingested document into HTML. Once we have HTML, we can utilize the existing ingestion pipeline.

I personally don't have a lot of experience with TEI. If there's someone out there who could help us understand established tools for transforming the XML to HTML, that would give us a head start on this.

zdavis avatar Dec 09 '20 18:12 zdavis

Hi, Zach! This is Elliott at the UW. I am new to this TEI/XML stuff, too, but one way that people convert their XML is with the software Oxygen. In Oxygen, you can then make your own XSLT stylesheets to convert things, but if you're not in Oxygen, then I think similar stylesheets live here:

https://oxgarage2.tei-c.org/

And that TEI Publisher site above has this playground that people use to see what their TEI/XML looks like:

https://teipublisher.com/exist/apps/tei-publisher/index.html?tab=0&collection=playground

(To be a default user on that site, you use "tei-demo" as the username and "demo" as the password.)

Does any of this help? If I'm not making sense I can share this information with one of the professors who teaches TEI and who works with students to make digital editions. Apparently, it would be an absolute dream if a platform like Manifold made this process easy for people who work with TEI.

rhoadsstevens avatar Dec 09 '20 18:12 rhoadsstevens

Hi Elliott!

This should be pretty doable. I haven't worked with XSLT/XML for many years, and need to refresh my memory on how these transformations work.

It looks like there are XSLT stylesheets to go from TEI to EPUB. We already have the ability to chain ingestion strategies in Manifold. For example, to ingest a word doc, we use Pandoc to turn word to HTML, then we just use the Manifold HTML ingestion strategy. Same approach for markdown.

In theory, we could write a simple convertor class in Ruby to turn the XML into an EPUB (using, for example: https://github.com/TEIC/Stylesheets/blob/dev/epub3/tei-to-epub3.xsl), and then ingest the EPUB into Manifold using the existing EPUB strategy.

It's possible that supporting this would be really kind of trivial—and it would be a super cool feature. I'll try to carve out some time for a proof of concept on it.

zdavis avatar Dec 09 '20 18:12 zdavis

Yay! Amazing news. I hope this works out, and I'll share this information with the professor who teaches this stuff. He might have some input along with elation.

rhoadsstevens avatar Dec 09 '20 18:12 rhoadsstevens

I had no idea what I was doing, but in the crudest way, a couple weeks ago, I tried converting some XML in Oxygen into an epub. Then I uploaded that epub into a Manifold test book I had and came up with this:

https://uw.manifoldapp.org/projects/test-for-the-anthropocene-backpack

(The XML-to-epub part is on the main page and in the "Digital Editions Test" text category.) In that weird epub, I also tried to annotate its first line with my xml file. I couldn't ingest xml as a Text, so I brought it in as a Resource.

It's all a mess, but the whole idea is to start with XML/TEI and then make a cool digital edition with Manifold--and all while still finding a way to make XML available to readers if they want it.

rhoadsstevens avatar Dec 09 '20 19:12 rhoadsstevens

I shared this information with the professor I mentioned, and this is what he said:

"This is great. Thanks. Yes, it’s the XSLT stylesheet which will transform the XML into HTML. And given that the HTML tags Manifold is looking for are pretty limited, it could be a fairly trivial stylesheet, I suspect. I don’t know that he should mess with the stylesheets in OxGarage or Oxygen, which are very complicated.

I think you’d want another page for TEI, with, like for Word and HTML, with a set of simple instructions to follow for marking up the text: with divs and heads and p's probably. Plus instructions for @rend, etc… Then just have a simple XSLT stylesheet which would turn those limited tags into HTML. Or EPUB, but I don’t know how to work with EPUB."

rhoadsstevens avatar Dec 09 '20 19:12 rhoadsstevens

He also followed up with this:

"Just to clarify, because I think what I wrote here is confusing: you’d want a 6th page here: https://manifoldapp.org/docs/projects/preparing/index, with TEI as a 6th type of document Manifold can ingest. Then like with Word, you’d have pretty narrow instructions for exactly how to mark-up the text with a limited set of TEI tags (it would have to be a fairly limited use of TEI). Then you’d want a stylesheet designed to turn those tags into HTML."

rhoadsstevens avatar Dec 09 '20 19:12 rhoadsstevens

I don't do a ton of work with TEI, but for the last digital edition project I worked on I ended up using CETEIcean to cut out XSLT entirely. I'm way more comfortable with a Javascript/HTML/CSS solution than XSLT, and I think that's probably the case for most developers, though perhaps not for editors of such editions. In any case, this might be another strategy to look at.

ColeDCrawford avatar Dec 09 '20 21:12 ColeDCrawford

That CETEIcean project is interesting, but I don't think it will help us here because it's built around web components. What we'll need is a way to transform the XML document to HTML that's supported in Manifold's reader.

Right now, I'm leaning toward incorporating something like Saxon to handle the initial XML transformation. The Ruby XML library that we're currently using for a lot of XML/HTML processing in Manifold, Nokogiri, doesn't have great support for more modern xpath selectors used in the stock TEI stylesheets. Manifold already needs a Java runtime to be present on the server for ElasticSearch and for the EPUB validator that we use, so using a java-based XML transformation engine doesn't add more low-level dependencies to Manifold.

Elliot, could the professor you're working with share a sample TEI document with us that is representative of the types of work he's considering publishing on Manifold? I've done a POC with EEBO texts, but they have some idiosyncrasies, so I'd like a better test case.

zdavis avatar Dec 10 '20 14:12 zdavis

Hi, all. I wasn't able to attach TEI/XML files to this note, so I emailed them to you, Zach, and Cc'ed the professor I've been working with. He also wrote you a message to give you a bit more context.

rhoadsstevens avatar Dec 11 '20 19:12 rhoadsstevens