readtext icon indicating copy to clipboard operation
readtext copied to clipboard

Support the TEI format

Open koheiw opened this issue 6 years ago • 2 comments

I recently learn that the TEI XML format is becoming popular in the linguistics community. In this format, texts are saved in small chunks with associated meta information (e.g. speaker), and, sometime, POS tags.

See: https://tei-c.org/ https://tei-c.org/activities/projects/ https://dracor.org/

koheiw avatar Dec 10 '19 08:12 koheiw

Great idea. There is a package called https://github.com/michaelgavin/tei2r/tree/master/R, but it looks pretty inactive.

kbenoit avatar Dec 10 '19 09:12 kbenoit

This would be cool. Not in the least because tools like GROBID allow you to parse out things like references and headers/footers etc. and saving it as TEI-xml. [I'm just starting to look into quanteda, so sorry if quanteda can do this natively already]

sdspieg avatar Dec 21 '20 21:12 sdspieg