gensim icon indicating copy to clipboard operation
gensim copied to clipboard

FastTextKeyedVectors.load_word2vec_format() does not initialize correctly.

Open rchurch4 opened this issue 3 years ago • 5 comments

Problem description

What are you trying to achieve? What is the expected result? What are you seeing instead? I have been trying to load my own trained FastText vectors into a FastTextKeyedVectors model (trained in Gensim). The expected result is to load the saved vectors without flaw. Instead, the error below occurs at initialization of the instance.

I investigated, and this occurs because FastTextKeyedVectors (FTKV) require the max_n and bucket parameters, whereas normal KeyedVectors do not. FTKV does not have a constructor specific to its parameters, and so fails.

Steps/code/corpus to reproduce

Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").

MWE:

from gensim.models.fasttext import FastTextKeyedVectors
from gensim.models import KeyedVectors


kv = KeyedVectors.load_word2vec_format('path_to_ft.txt', binary=False)
ftkv = FastTextKeyedVectors.load_word2vec_format('path_to_ft.txt', binary=False)
Traceback (most recent call last):
  File "fasttext_test.py", line 6, in <module>
    ftkv = FastTextKeyedVectors.load_word2vec_format('data/local_election2020_temporal_medium_ft.txt', binary=False)
  File "/home/rob/.env/topics/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1632, in load_word2vec_format
    limit=limit, datatype=datatype, no_header=no_header,
  File "/home/rob/.env/topics/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1906, in _load_word2vec_format
    kv = cls(vector_size, vocab_size, dtype=datatype)
TypeError: __init__() missing 2 required positional arguments: 'max_n' and 'bucket'

For full reproducability, I used embeddings trained on the Twenty Newsgroups data set, as well as a few larger social media data sets. I found this bug while investigating another bug having to do with the saving of FastText (and possibly W2V) vectors in the w2v text format. On some lines when saving vectors, two words are written to the beginning of the line. I will post separately about this bug once I have convinced myself that it is not user error.

I'm not sure if lifecycle events works in this case because it's pre-initialization.

If your problem is with a specific Gensim model (word2vec, lsimodel, doc2vec, fasttext, ldamodel etc), include the following:

print(my_model.lifecycle_events)

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
>>> import platform; print(platform.platform())
Linux-5.4.0-1055-gcp-x86_64-with-Ubuntu-18.04-bionic
>>> import sys; print("Python", sys.version)
Python 3.6.9 (default, Jan 26 2021, 15:33:00) 
[GCC 8.4.0]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.19.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.4
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.0.1
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

rchurch4 avatar Oct 26 '21 03:10 rchurch4

Alas, a set of word-vectors that's only in plain word2vec-format doesn't have the other info (like character ngrams) needed for something to truly be a FastTextKeyedVectors. So my initial thought here is to just ensure load_word2vec_format() on FastTextKeyedVectors throws a warning/error, directing the user to use KeyedVectors instead.

Or is there a good reason to load less-than-full-FT vectors into a FT-supporting class that I'm not considering?

gojomo avatar Oct 27 '21 17:10 gojomo

Alas, a set of word-vectors that's only in plain word2vec-format doesn't have the other info (like character ngrams) needed for something to truly be a FastTextKeyedVectors. So my initial thought here is to just ensure load_word2vec_format() on FastTextKeyedVectors throws a warning/error, directing the user to use KeyedVectors instead.

Or is there a good reason to load less-than-full-FT vectors into a FT-supporting class that I'm not considering?

There's not any good reason that I know of other than the train of thought that I followed, which was "I trained these vectors using FT, so I should probably load them using FTKV. I agree that a warning or error would be the way to go here. If you wanted to make it seamless you could issue a warning and initialize with default values for the two parameters in question, but you'd essentially just be a wrapper for KeyedVectors.

rchurch4 avatar Oct 27 '21 18:10 rchurch4

I get the same error with v4.1.2 using Python 3.7 on Windows. The only difference is that the exception occurs in line 1969 in my version.

The line kv = cls(vector_size, vocab_size, dtype=datatype) creates an instance of the class which is FastTextKeyedVectors, however the list of parameters suits KeyedVectors and does not suit the former. Loading of the same file into KeyedVectors works fine.

Answering the obvious question why I need FastTextKeyedVectors: my issue is that I have a lot of OOV words in my texts; as far as I know FastText should work with them.

GBR-613 avatar Nov 01 '21 11:11 GBR-613

Answering the obvious question why I need FastTextKeyedVectors: my issue is that I have a lot of OOV words in my texts; as far as I know FastText should work with them.

But if you're loading a plain full-words vector set – the only thing that can be loaded via the formats supported by load_word2vec_format() – there will be no subword information to help with OOV vectors, and thus no FastText benefit.

In a hypothetical world where the FTKV class were fixed such that load_word2vec_format() works, it'd be an open question as to whether the character-n-grams should be left uninitialized – returning zero-vectors for all OOV words – or traditionally randomly-initialized albeit never-trained – returning random nonsense for all OOV words. I don't think in either case such nonsense OOV vectors are what people would be hoping for.

(I suppose a 3rd option might be to devise some nonstandard process for bootstrapping char-n-gram vectors from known full-word vectors, for which I think there would be a few obvious potential approaches. But that'd be novel patching-up of non-FastText source vectors – not truly standard FastText.)

That's why my preferred fix here would remain an error with a message referring the user to plain KeyedVectors for reading plain whole-word vector files.

gojomo avatar Nov 01 '21 17:11 gojomo

I'd also prefer a clear error message, telling the user a) what's wrong (the data on disk doesn't contain what the information needed) and b) what to do instead (use KeyedVectors, or find a richer dataset in another format that doe contain the subword info).

piskvorky avatar Nov 01 '21 19:11 piskvorky