ParlaMint
ParlaMint copied to clipboard
Encoding USAS in TEI
This issue discusses the non-resloved problems from #202 and #204. The current encoding of USAS in TEI is given in the guidelines, which is arguably ok, even though other possibilites exist (in particular stand-off markup where there are no problems with crossing XML tags but resolving them then gets complicated). Also, it is not yet clear whether retaining per-word USAS tags is sensible in the context of MWEs. These dilemas should be solved here.
The conversion of CoNLL-U with USAS tags into TEI is done by the conllu2tei.pl script. This script is badly written (it first just inserts <name>
and <phr>
into a temporary TEI and then afterwards tries to resolve conflicts, but does so in a bad way, i.e. it removes <phr>
elements even in cases where it shouldn't, in particular phr/name, (arguably) name/phr, and and when a phr is adjecent to name, which is a definite bug. Again, how to make the script better should be discussed here.