torchtext-summary icon indicating copy to clipboard operation
torchtext-summary copied to clipboard

关于预训练词向量加载报错

Open jwc19890114 opened this issue 6 years ago • 1 comments

在language model中,看到要加载word2vec.6B.100d这个预训练模型,我使用的是glove.6B.50d,但是会报错。求解

Traceback (most recent call last): File "D:/DesktopBackup/right/MLHomework/AllenNLP/[NLP]Pytorch17_torchTextDemo.py", line 75, in wvmodel = gensim.models.KeyedVectors.load_word2vec_format(r'D:\DesktopBackup\right\MLHomework\AllenNLP\data\glove.6B.50d.txt', binary=False, encoding='utf-8') File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\keyedvectors.py", line 1476, in load_word2vec_format limit=limit, datatype=datatype) File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\utils_any2vec.py", line 344, in _load_word2vec_format vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format File "C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\utils_any2vec.py", line 344, in vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format ValueError: invalid literal for int() with base 10: 'the'

jwc19890114 avatar Apr 17 '19 07:04 jwc19890114

word2vec和glove的格式不同,你需要将glove转化为word2vec的格式,gensim有这个功能。

atnlp avatar Apr 30 '19 05:04 atnlp