apertium
apertium copied to clipboard
Some taggers chop off bits of words
$ echo "P'edon war bont an Naoned" | apertium -d . br-fr-tagger
^pa<cnjsub>$ ^ ^war<pr>$ ^pont<n><m><sg>$ ^an<det><def><sp>$ ^Naoned<np><top><sg>$^.<sent>$
Is this a problem with lt-proc
, the br
dictionary, or something else?
It's a problem with the HMM tagger when a .prob
file is used on a dictionary where it hasn't seen the MLUs defined in the .tsx
file.
echo "P'edon" | lt-proc -w '/home/daniel/apertium-data/apertium-br-fr/br-fr.automorf.bin' | cg-proc '/home/daniel/apertium-data/apertium-br-fr/br-fr.rlx.bin'
^P'edon/pa<cnjsub>+bezañ<vbloc><pii><p1><sg><@+FMAINV>$
echo "P'edon" | lt-proc -w '/home/daniel/apertium-data/apertium-br-fr/br-fr.automorf.bin' | cg-proc '/home/daniel/apertium-data/apertium-br-fr/br-fr.rlx.bin' | apertium-tagger -g -d br-fr.prob
Warning: There is not coarse tag for the fine tag ''
This is because of an incomplete tagset definition or a dictionary error
^pa<cnjsub>+Error: A new ambiguity class was found.
Retraining the tagger is necessary so as to take it into account.
Word ''.
New ambiguity class: {TAG_kUNDEF,CNJSUBS}
The specific issue is that <vbloc><pii><p1><sg><@+FMAINV>
is not a set of tags that the br-fr tagger recognizes and in the specific situation that the tagger reads a compound and recognizes the first part but not the second part, it will output ^first+
and then stop printing that LU.
If it's the first part that is unrecognized or if it's not an MLU, there is no issue.
#113 fixes br-fr for me.