gensim
gensim copied to clipboard
FastTextKeyedVectors.load_word2vec_format() does not initialize correctly.
Problem description
What are you trying to achieve? What is the expected result? What are you seeing instead? I have been trying to load my own trained FastText vectors into a FastTextKeyedVectors model (trained in Gensim). The expected result is to load the saved vectors without flaw. Instead, the error below occurs at initialization of the instance.
I investigated, and this occurs because FastTextKeyedVectors
(FTKV) require the max_n
and bucket
parameters, whereas normal KeyedVectors
do not. FTKV does not have a constructor specific to its parameters, and so fails.
Steps/code/corpus to reproduce
Include full tracebacks, logs and datasets if necessary. Please keep the examples minimal ("minimal reproducible example").
MWE:
from gensim.models.fasttext import FastTextKeyedVectors
from gensim.models import KeyedVectors
kv = KeyedVectors.load_word2vec_format('path_to_ft.txt', binary=False)
ftkv = FastTextKeyedVectors.load_word2vec_format('path_to_ft.txt', binary=False)
Traceback (most recent call last):
File "fasttext_test.py", line 6, in <module>
ftkv = FastTextKeyedVectors.load_word2vec_format('data/local_election2020_temporal_medium_ft.txt', binary=False)
File "/home/rob/.env/topics/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1632, in load_word2vec_format
limit=limit, datatype=datatype, no_header=no_header,
File "/home/rob/.env/topics/lib/python3.6/site-packages/gensim/models/keyedvectors.py", line 1906, in _load_word2vec_format
kv = cls(vector_size, vocab_size, dtype=datatype)
TypeError: __init__() missing 2 required positional arguments: 'max_n' and 'bucket'
For full reproducability, I used embeddings trained on the Twenty Newsgroups data set, as well as a few larger social media data sets. I found this bug while investigating another bug having to do with the saving of FastText (and possibly W2V) vectors in the w2v text format. On some lines when saving vectors, two words are written to the beginning of the line. I will post separately about this bug once I have convinced myself that it is not user error.
I'm not sure if lifecycle events works in this case because it's pre-initialization.
If your problem is with a specific Gensim model (word2vec, lsimodel, doc2vec, fasttext, ldamodel etc), include the following:
print(my_model.lifecycle_events)
Versions
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import struct; print("Bits", 8 * struct.calcsize("P"))
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
>>> import platform; print(platform.platform())
Linux-5.4.0-1055-gcp-x86_64-with-Ubuntu-18.04-bionic
>>> import sys; print("Python", sys.version)
Python 3.6.9 (default, Jan 26 2021, 15:33:00)
[GCC 8.4.0]
>>> import struct; print("Bits", 8 * struct.calcsize("P"))
Bits 64
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.19.5
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.5.4
>>> import gensim; print("gensim", gensim.__version__)
gensim 4.0.1
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1
Alas, a set of word-vectors that's only in plain word2vec-format doesn't have the other info (like character ngrams) needed for something to truly be a FastTextKeyedVectors
. So my initial thought here is to just ensure load_word2vec_format()
on FastTextKeyedVectors
throws a warning/error, directing the user to use KeyedVectors
instead.
Or is there a good reason to load less-than-full-FT vectors into a FT-supporting class that I'm not considering?
Alas, a set of word-vectors that's only in plain word2vec-format doesn't have the other info (like character ngrams) needed for something to truly be a
FastTextKeyedVectors
. So my initial thought here is to just ensureload_word2vec_format()
onFastTextKeyedVectors
throws a warning/error, directing the user to useKeyedVectors
instead.Or is there a good reason to load less-than-full-FT vectors into a FT-supporting class that I'm not considering?
There's not any good reason that I know of other than the train of thought that I followed, which was "I trained these vectors using FT, so I should probably load them using FTKV. I agree that a warning or error would be the way to go here. If you wanted to make it seamless you could issue a warning and initialize with default values for the two parameters in question, but you'd essentially just be a wrapper for KeyedVectors
.
I get the same error with v4.1.2 using Python 3.7 on Windows. The only difference is that the exception occurs in line 1969 in my version.
The line kv = cls(vector_size, vocab_size, dtype=datatype)
creates an instance of the class which is FastTextKeyedVectors, however the list of parameters suits KeyedVectors and does not suit the former.
Loading of the same file into KeyedVectors works fine.
Answering the obvious question why I need FastTextKeyedVectors: my issue is that I have a lot of OOV words in my texts; as far as I know FastText should work with them.
Answering the obvious question why I need FastTextKeyedVectors: my issue is that I have a lot of OOV words in my texts; as far as I know FastText should work with them.
But if you're loading a plain full-words vector set – the only thing that can be loaded via the formats supported by load_word2vec_format()
– there will be no subword information to help with OOV vectors, and thus no FastText benefit.
In a hypothetical world where the FTKV
class were fixed such that load_word2vec_format()
works, it'd be an open question as to whether the character-n-grams should be left uninitialized – returning zero-vectors for all OOV words – or traditionally randomly-initialized albeit never-trained – returning random nonsense for all OOV words. I don't think in either case such nonsense OOV vectors are what people would be hoping for.
(I suppose a 3rd option might be to devise some nonstandard process for bootstrapping char-n-gram vectors from known full-word vectors, for which I think there would be a few obvious potential approaches. But that'd be novel patching-up of non-FastText source vectors – not truly standard FastText
.)
That's why my preferred fix here would remain an error with a message referring the user to plain KeyedVectors
for reading plain whole-word vector files.
I'd also prefer a clear error message, telling the user a) what's wrong (the data on disk doesn't contain what the information needed) and b) what to do instead (use KeyedVectors, or find a richer dataset in another format that doe contain the subword info).