python-crfsuite icon indicating copy to clipboard operation
python-crfsuite copied to clipboard

UnicodeDecodeError at tag method

Open umoqnier opened this issue 5 years ago • 1 comments

Currently I base my code on this tutorial and I have some problems with tag method after the train section. I catch the UnicodeDecodeError exception like this

try:
    for xseq in X_test:
        Y_pred.append(tagger.tag(xseq))
except UnicodeDecodeError as e:
    print(e)    
    print(e.object)

The output looks like this

'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)
b'B-qu\xc3\xa9'

I tried to decode my X_test before tag using decode('utf-8') but does seems not to works.

Just in case, I had some UnicodeEncodeError problems at the trainer object as shown below but seems that works using encode('utf-8') for every substring. With this method I'm forcing manual encoding before append objects in trainer. This issue is mentioned at #96 and this solution works for me.

for xseq, yseq in zip(X_train, Y_train):    
    trainer.append(xseq, yseq)

NOTE: Sorry for my deficent english. I hope I've been clear enough. If not, please tell me!!! :)

umoqnier avatar May 15 '19 05:05 umoqnier

Hello,

I have exactly the same issue, if I am able to train my model with bytes but when I use the tagger if the output is a bytes there is an internal error (same as above) which provide me to get the tag.

The only solution I have for the moment is to use crfsuite instead which is able to output non-ascii tags...

Iito avatar May 19 '19 23:05 Iito