CoNLL-U plus
Hi!
I cannot find this in documentation: I was wondering if UDAPI already includes ways to deal with CoNLL-U plus files (i.e. read, write...). In particular, I am interested in expanding an existing regular CoNLL-U file into a plus one by adding new custom columns.
Thanks!
Udapi does not support CoNLL-U Plus yet. There is read.Conll with parameter attributes, where you can specify which columns are in a given file, but it uses setattr(node, attribute_name, value) internally, which means that only existing attribute names can be used as column names (or an underscore meaning that a given column should be ignored by the reader).
I would welcome if someone sends a PR adding read.Conlluplus (or read.Conllup considering that .conllup is the recommended file extension) and write.Conlluplus. That would mean interpreting the global.columns header (perhaps storing it to document.meta['global.columns'] similarly to document.meta['global.Entity']. The question is where to store the extra (non-standard) columns and how to name them (lowercase?). I would suggest storing them in node.misc, so e.g. global.columns = ID FORM PARSEME:MWE results in node.misc["parseme:mwe"] containing the values from the last column. When serializing this document using write.Conlluplus with document.meta['global.columns'] == "ID FORM MISC PARSEME:MWE", the parseme:mwe attribute would not be stored in MISC, but in the last (PARSEME:MWE) column. This would allow the users to easily convert between different formats (possibly using e.g. udapy read.Conllu files=input.conllu util.Eval doc='doc.meta['global.columns'] = "ID FORM LEMMA MISC PARSEME:MWE"' write.Conlluplus files=output.conllup).