Problem in annotation consistency for French

Open GoogleCodeExporter opened this issue 10 years ago • 0 comments

What steps will reproduce the problem?
1. cat fr-universal-test.conll|tr "\t" " "|cut -d" " -f2,4|grep "%"|sort|uniq -c
2. cat fr-universal-train.conll|tr "\t" " "|cut -d" " -f2,4|grep "%"|sort|uniq 
-c

What is the expected output? What do you see instead?

What we get:
- for the test set: 
12 % NOUN
- for the train set:
  17 % NOUN
 231 % X

There should either be a single label for all % token or, at least, similar 
distribution in the test and in the train sets.

What version of the product are you using? On what operating system?

universal_treebanks_v2.0.tar.gz

Please provide any additional information below.

There seems to be problems, at least, for "%", "Mr." and hours. There is also a 
tokenization problem for hours that are either tokenized (18_h_23) or not 
(18h23).

The problem may also appear for other languages: when we trained a standard 
supervised POS tagger, the error rate is always much larger (> 4%) than the one 
reported using the datasets of the paper "A Universal Part-of-Speech Tagset"

Original issue reported on code.google.com by [email protected] on 10 Sep 2014 at 9:02

Jun 10 '15 08:06 GoogleCodeExporter