wordvectors UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Hi,

I am trying to load Chinese pretrained word2vec, word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) # C binary format

it throws this error.

Jan 22 '18 18:01 liwzhi

of cause the vector should be trained using the proper codec, it seems the model is trained in other coding environment. Can you check that.

Jan 26 '18 05:01 wiwengweng

I have come across the same error, anybody help? Thank you ~

Jan 30 '18 08:01 lxw0109

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

Jan 30 '18 14:01 galuhsahid

@galuhsahid Thank you so much, it works now. : )

Jan 31 '18 02:01 lxw0109

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

Mar 09 '18 10:03 anavaldi

Same error as @anavaldi . Any solution?

Mar 18 '18 23:03 Priya22

I solve this error by executing on my own word embeddings with the .sh file.

Mar 19 '18 11:03 anavaldi

I have come across the same error. I changed gensim.models.KeyedVectors.load_word2vec_format（） into gensim.models.Word2Vec.load() .Then it works

Apr 25 '18 11:04 hinanmu

@hinamu it works, Thanks

Apr 26 '18 07:04 changhyub

@anavaldi

I solve this error by executing on my own word embeddings with the .sh file.

What do you mean?

May 22 '18 23:05 gilgtc

I have tried to read the files as you pointed, but I got the next error:

 File "C:\ProgramData\Anaconda2\lib\site-packages\gensim\models\base_any2vec.py", line 380, in syn1neg
    self.trainables.syn1neg = value

AttributeError: 'Word2Vec' object has no attribute 'trainables'

:(

I solved this issue by degrading my gensim version from 3.6 to 3.0

Jan 17 '19 08:01 caitaozhan

UnpicklingError Traceback (most recent call last) in () 3 #model=gensim.models.Word2Vec.load_word2vec_format('model_file', binary=True) Word2Vec.load_word2vec_format 4 #model_bin = KeyedVectors.load_word2vec_format(model_file,binary=True) ----> 5 model=gensim.models.Word2Vec.load(model_file) 6 #model=gensim.Word2Vec.load_word2vec_format('model_file',binary=True) word_vectors = KeyedVectors.load(path) why is it giving

Jun 18 '19 05:06 kusumlata123

@kusumlata123 even i am getting that Unpickling Error

Aug 06 '19 10:08 Koteswara-ML

I am also getting the unpickling error... Any ideas? My code is:

chinese_model = gensim.models.Word2Vec.load(os.path.join(desktop, 'cc.zh.300.bin.gz'))

Sep 16 '19 06:09 bright1993ff66

I also tried to save the text file and load it via the function provided by the fasttext official site. I first change the file extension from gz to txt and use the following functions:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

However, I got the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-d67f52bde947> in <module>
----> 1 model = load_vectors(os.path.join(desktop, 'cc.zh.300.vec.txt'))

<ipython-input-3-0f69b5ce62b8> in load_vectors(fname)
      1 def load_vectors(fname):
      2     fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
----> 3     n, d = map(int, fin.readline().split())
      4     data = {}
      5     for line in fin:

ValueError: invalid literal for int() with base 10: '\x08\x08p[\x00\x03cc.zh.300.vec\x00\\ͮfMr7?W3ۀ0|Szдl\x14I\x132'

Sep 16 '19 06:09 bright1993ff66

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

Apr 19 '20 13:04 thejastr

I came across the same error as well. I changed:

word_vectors = KeyedVectors.load_word2vec_format(path, binary=True)

into

word_vectors = KeyedVectors.load(path)

It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

Aug 04 '21 07:08 ashutoshsoni891

I came across the same error as well. I changed: word_vectors = KeyedVectors.load_word2vec_format(path, binary=True) into word_vectors = KeyedVectors.load(path) It turns out that load_word2vec_format is used when we're trying to load word vectors that are trained using the original implementation of word2vec (in C). Since these pre-trained word vectors are trained using Python (gensim), we can use load instead.

When I tried this , I am getting : UnpicklingError: unpickling stack underflow

For Korean language, i got this error: 'AttributeError: Can't get attribute 'Vocab' on <module 'gensim.models.word2vec' from 'C:\Users\ductr\Python\lib\site-packages\gensim\models\word2vec.py'>' Would you mind letting me know what the error is?

Dec 23 '21 05:12 trungluu91

I tried the above solution but I am getting error as: UnpicklingError: invalid load key, '\x1f' My code: from gensim import models

word2vec_path = 'GoogleNews-vectors-negative300.bin.gz.2' word2vec = models.KeyedVectors.load(word2vec_path)

I get the same error after using:

from gensim.models import Word2Vec
from gensim.models.keyedvectors import KeyedVectors
model = Word2Vec.load(model_path)

What am I doing wrong?

Aug 07 '23 14:08 Louislazarus

wordvectors wordvectors copied to clipboard

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

wordvectors
wordvectors copied to clipboard