ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

Encoding USAS in TEI

Open TomazErjavec opened this issue 1 year ago • 1 comments

This issue discusses the non-resloved problems from #202 and #204. The current encoding of USAS in TEI is given in the guidelines, which is arguably ok, even though other possibilites exist (in particular stand-off markup where there are no problems with crossing XML tags but resolving them then gets complicated). Also, it is not yet clear whether retaining per-word USAS tags is sensible in the context of MWEs. These dilemas should be solved here.

The conversion of CoNLL-U with USAS tags into TEI is done by the conllu2tei.pl script. This script is badly written (it first just inserts <name> and <phr> into a temporary TEI and then afterwards tries to resolve conflicts, but does so in a bad way, i.e. it removes <phr> elements even in cases where it shouldn't, in particular phr/name, (arguably) name/phr, and and when a phr is adjecent to name, which is a definite bug. Again, how to make the script better should be discussed here.

TomazErjavec avatar Nov 12 '23 10:11 TomazErjavec