UD_Portuguese-Bosque
UD_Portuguese-Bosque copied to clipboard
inconsistencies on DET lemmas
I've found some tokens whose form are "a" but are lemmatized as "o". Is this correct?
Some examples:
# forest 1
# source CETEMPúblico n=9 sec=clt sem=95a
# sent_id CP9-1
# id 49
10 em em ADP <sam->|PRP|@<PIV _ 12 case _ _
11 a o DET <-sam>|<artd>|ART|@>N Definite=Def|PronType=Art 12 det _ _
12 história história NOUN <np-def>|N|F|S|@P< Gender=Fem|Number=Sing 8 nmod _ _
# forest 1
# source CETEMPúblico n=99 sec=des sem=93a
# sent_id CP99-3
# id 525
1 A o DET <artd>|ART|F|S|@>N Definite=Def|Gender=Fem|Number=Sing|PronType=Art 2 det _ _
2 dupla dupla NOUN <np-def>|N|F|S|@SUBJ> Gender=Fem|Number=Sing 8 nsubj _ _
I've found and counted 9598 examples of this type with awk '$2 ~ /\ya\y/ && $3 ~ /o/ { count++} END {print count}' *.conllu
.