uni-dep-tb
uni-dep-tb copied to clipboard
Problem in annotation consistency for French
What steps will reproduce the problem?
1. cat fr-universal-test.conll|tr "\t" " "|cut -d" " -f2,4|grep "%"|sort|uniq -c
2. cat fr-universal-train.conll|tr "\t" " "|cut -d" " -f2,4|grep "%"|sort|uniq
-c
What is the expected output? What do you see instead?
What we get:
- for the test set:
12 % NOUN
- for the train set:
17 % NOUN
231 % X
There should either be a single label for all % token or, at least, similar
distribution in the test and in the train sets.
What version of the product are you using? On what operating system?
universal_treebanks_v2.0.tar.gz
Please provide any additional information below.
There seems to be problems, at least, for "%", "Mr." and hours. There is also a
tokenization problem for hours that are either tokenized (18_h_23) or not
(18h23).
The problem may also appear for other languages: when we trained a standard
supervised POS tagger, the error rate is always much larger (> 4%) than the one
reported using the datasets of the paper "A Universal Part-of-Speech Tagset"
Original issue reported on code.google.com by [email protected] on 10 Sep 2014 at 9:02