emtsv icon indicating copy to clipboard operation
emtsv copied to clipboard

Handle CoNLL-U comments

Open DavidNemeskey opened this issue 4 years ago • 3 comments

emtsv does not handle CoNLL-U comments very well. If the input is a tsv file, two things happen:

  1. If the file only has the form column, comments (lines starting with "# ") are treated as a token and are analyzed as a single "word" token
  2. If the file has other columns (e.g. form anas lemma xpostag to which I want to add upostag feats), only the new header is returned.

Expected behavior: comments should be kept in the text and returned as-is, and they should not prevent emtsv to analyze the text (as in the second case).

DavidNemeskey avatar Aug 19 '21 09:08 DavidNemeskey

CoNLL-U comments need to be explicitly enabled with conllu-comments parameter. We may flip the default behaviour to enabled in some future release.

I agree that the documentation is very coarse on this.

dlazesz avatar Aug 19 '21 10:08 dlazesz

Yes, I think it would make sense if that was the default. Should I do it in a PR (+ add a sentence about it to the docs)?

DavidNemeskey avatar Aug 19 '21 11:08 DavidNemeskey

Specifiing this in the docs is ok, but changing the default in xtsv requires new major version at least in xtsv. These breaking changes should be commited in batches to minimise disruption. (We have others in mind.)

@mittelholcz What do you think?

dlazesz avatar Aug 23 '21 08:08 dlazesz