UD_Portuguese-Bosque icon indicating copy to clipboard operation
UD_Portuguese-Bosque copied to clipboard

inconsistencies on DET lemmas

Open GPPassos opened this issue 8 years ago • 9 comments

I've found some tokens whose form are "a" but are lemmatized as "o". Is this correct?

Some examples:

# forest 1
# source CETEMPúblico n=9 sec=clt sem=95a
# sent_id CP9-1
# id 49
10	em	em	ADP	<sam->|PRP|@<PIV	_	12	case	_	_
11	a	o	DET	<-sam>|<artd>|ART|@>N	Definite=Def|PronType=Art	12	det	_	_
12	história	história	NOUN	<np-def>|N|F|S|@P<	Gender=Fem|Number=Sing	8	nmod	_	_
# forest 1
# source CETEMPúblico n=99 sec=des sem=93a
# sent_id CP99-3
# id 525
1	A	o	DET	<artd>|ART|F|S|@>N	Definite=Def|Gender=Fem|Number=Sing|PronType=Art	2	det	_	_
2	dupla	dupla	NOUN	<np-def>|N|F|S|@SUBJ>	Gender=Fem|Number=Sing	8	nsubj	_	_ 

I've found and counted 9598 examples of this type with awk '$2 ~ /\ya\y/ && $3 ~ /o/ { count++} END {print count}' *.conllu.

GPPassos avatar Jan 23 '17 11:01 GPPassos