wordvectors icon indicating copy to clipboard operation
wordvectors copied to clipboard

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Open liwzhi opened this issue 7 years ago • 19 comments

Hi,

I am trying to load Chinese pretrained word2vec, word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

liwzhi avatar Jan 22 '18 18:01 liwzhi

of cause the vector should be trained using the proper codec, it seems the model is trained in other coding environment. Can you check that.

wiwengweng avatar Jan 26 '18 05:01 wiwengweng

I have come across the same error, anybody help? Thank you ~

lxw0109 avatar Jan 30 '18 08:01 lxw0109

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

galuhsahid avatar Jan 30 '18 14:01 galuhsahid

@galuhsahid Thank you so much, it works now. : )

lxw0109 avatar Jan 31 '18 02:01 lxw0109

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

anavaldi avatar Mar 09 '18 10:03 anavaldi

Same error as @anavaldi . Any solution?

Priya22 avatar Mar 18 '18 23:03 Priya22

I solve this error by executing on my own word embeddings with the .sh file.

anavaldi avatar Mar 19 '18 11:03 anavaldi

I have come across the same error. I changed gensim.models.KeyedVectors.load_word2vec_format() into gensim.models.Word2Vec.load() .Then it works

hinanmu avatar Apr 25 '18 11:04 hinanmu

@hinamu it works, Thanks

changhyub avatar Apr 26 '18 07:04 changhyub

@anavaldi

I solve this error by executing on my own word embeddings with the .sh file.

What do you mean?

gilgtc avatar May 22 '18 23:05 gilgtc

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

I solved this issue by degrading my gensim version from 3.6 to 3.0

caitaozhan avatar Jan 17 '19 08:01 caitaozhan

UnpicklingError Traceback (most recent call last) in () 3 #model=gensim.models.Word2Vec.load_word2vec_format('model_file', binary=True) Word2Vec.load_word2vec_format 4 #model_bin = KeyedVectors.load_word2vec_format(model_file,binary=True) ----> 5 model=gensim.models.Word2Vec.load(model_file) 6 #model=gensim.Word2Vec.load_word2vec_format('model_file',binary=True) word_vectors = KeyedVectors.load(path) why is it giving

kusumlata123 avatar Jun 18 '19 05:06 kusumlata123

@kusumlata123 even i am getting that Unpickling Error

Koteswara-ML avatar Aug 06 '19 10:08 Koteswara-ML

I am also getting the unpickling error... Any ideas? My code is:

chinese_model = gensim.models.Word2Vec.load(os.path.join(desktop, 'cc.zh.300.bin.gz')) 

bright1993ff66 avatar Sep 16 '19 06:09 bright1993ff66

I also tried to save the text file and load it via the function provided by the fasttext official site. I first change the file extension from gz to txt and use the following functions:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

However, I got the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d67f52bde947> in <module>
----> 1 model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

<ipython-input-3-0f69b5ce62b8> in load_vectors(fname)
      1 def load_vectors(fname):
      2     fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
----> 3     n, d = map(int, fin.readline().split())
      4     data = {}
      5     for line in fin:

ValueError: invalid literal for int() with base 10: '\x08\x08p[\x00\x03cc.zh.300.vec\x00\\ͮfMr7?W3ۀ0|Szдl\x14I\x132'

bright1993ff66 avatar Sep 16 '19 06:09 bright1993ff66

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

thejastr avatar Apr 19 '20 13:04 thejastr

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

ashutoshsoni891 avatar Aug 04 '21 07:08 ashutoshsoni891

I came across the same error as well. I changed: word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) into word_vectors = KeyedVectors.load(path) It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

For Korean language, i got this error: 'AttributeError: Can't get attribute 'Vocab' on <module 'gensim.models.word2vec' from 'C:\Users\ductr\Python\lib\site-packages\gensim\models\word2vec.py'>' Would you mind letting me know what the error is?

trungluu91 avatar Dec 23 '21 05:12 trungluu91

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

I get the same error after using:

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
model = Word2Vec.load(model_path)

What am I doing wrong?

Louislazarus avatar Aug 07 '23 14:08 Louislazarus