apertium icon indicating copy to clipboard operation
apertium copied to clipboard

Some taggers chop off bits of words

Open ftyers opened this issue 4 years ago • 5 comments

$ echo "P'edon war bont an Naoned" | apertium -d . br-fr-tagger
^pa<cnjsub>$ ^ ^war<pr>$ ^pont<n><m><sg>$ ^an<det><def><sp>$ ^Naoned<np><top><sg>$^.<sent>$

ftyers avatar Dec 21 '20 09:12 ftyers

Is this a problem with lt-proc, the br dictionary, or something else?

jonorthwash avatar Dec 22 '20 13:12 jonorthwash

It's a problem with the HMM tagger when a .prob file is used on a dictionary where it hasn't seen the MLUs defined in the .tsx file.

ftyers avatar Dec 22 '20 14:12 ftyers

echo "P'edon" | lt-proc -w '/home/daniel/apertium-data/apertium-br-fr/br-fr.automorf.bin' | cg-proc '/home/daniel/apertium-data/apertium-br-fr/br-fr.rlx.bin'
^P'edon/pa<cnjsub>+bezañ<vbloc><pii><p1><sg><@+FMAINV>$
echo "P'edon" | lt-proc -w '/home/daniel/apertium-data/apertium-br-fr/br-fr.automorf.bin' | cg-proc '/home/daniel/apertium-data/apertium-br-fr/br-fr.rlx.bin' | apertium-tagger -g -d br-fr.prob 
Warning: There is not coarse tag for the fine tag ''
         This is because of an incomplete tagset definition or a dictionary error
^pa<cnjsub>+Error: A new ambiguity class was found. 
Retraining the tagger is necessary so as to take it into account.
Word ''.
New ambiguity class: {TAG_kUNDEF,CNJSUBS}

The specific issue is that <vbloc><pii><p1><sg><@+FMAINV> is not a set of tags that the br-fr tagger recognizes and in the specific situation that the tagger reads a compound and recognizes the first part but not the second part, it will output ^first+ and then stop printing that LU.

mr-martian avatar Dec 22 '20 14:12 mr-martian

If it's the first part that is unrecognized or if it's not an MLU, there is no issue.

mr-martian avatar Dec 22 '20 14:12 mr-martian

#113 fixes br-fr for me.

mr-martian avatar Dec 22 '20 15:12 mr-martian