ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

Generating initial TEI from txt and tsv?

Open nljubesi opened this issue 3 years ago • 2 comments

Hi, I wondered whether there is already code available for generating a preliminary TEI if we have txt and tsv files already available in the proper format.

From what I understand, the generation direction should be the opposite (TEI->(txt,tsv)), but it seems that quite a lot of TEI could be generated, together with placeholders to be additionally filled out manually, from (txt, tsv).

Such a script might come useful to many people, I guess. Mentioning @5roop as he might have fun with preparing such a script, if this is a valid idea. I could not find this discussion neither in open nor in closed issues, but I might have missed something. Sorry if this is the case.

nljubesi avatar Sep 26 '22 06:09 nljubesi

No, there is no such script, I think also because most people started from HTML or XML, not plain text. Still, it might be a good thing but it might be more difficult than it seems at first glance:

  • I don't think persons and organisations can be easily modelled as TSV
  • The medata about a corpus component could be stored in TSV
  • The utterances (which is what you probably meant) could be modelled: utterance metadata = tsv, utterance text = txt
  • A problem with the utterances is the transcriptor comments, you would need to introduce some special symbols there.
  • And, I guess you wouldn't then use any "extra" markup, such as page breaks.

There might be other details to consider, once somebody gets to grips with the data. This is now off the top of my head, pending a volunteer to be found...

TomazErjavec avatar Sep 26 '22 09:09 TomazErjavec

Just in case it is helpful: I have a script to convert CoNLL-U files to ParlaMint v1 at https://github.com/coltekin/ParlaMint-TR/blob/main/tomint.py. This assumes quite some information provided as specific comments in the CoNLL-U files, and a TSV file for speaker information. The version for v2 is in progress, I plan to push it to this repo, once the output passes the validation (hopefully in a few days).

coltekin avatar Sep 26 '22 10:09 coltekin