gensim-data
gensim-data copied to clipboard
Pretrained FastText doesn't handle OOV words
Loading FastText using gensim.downloader returns KeyedVectors object. Why is that? In the model name (fasttext-wiki-news-subwords-300) it seems like it should be able to use algorithm's ability to encode OOV words, but now it doesn't do that.
Also, loading downloaded model (from path returned from gensim_data_downloader) using gensim.models.FastText doesn't work.
Thanks for reporting. That does sound like a bug to me. CC @mpenkov can you please have a look?
I agree that it sounds like a bug.
@lambdaofgod Could you please provide a reproducible example?
import gensim.downloader
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')
model
<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fc5b3964f60>
model.get_vector('dogge')
KeyError: "word 'dogge' not in vocabulary"
Which is not something that you expect from a method that uses subword information.
Thank you for providing the reproducible example. Can you please include the full stack trace?
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-17-c007c0b2c10b> in <module>()
----> 1 fasttext_w2v_format.get_vector('dogge')
1 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
450 return result
451 else:
--> 452 raise KeyError("word '%s' not in vocabulary" % word)
453
454 def get_vector(self, word):
KeyError: "word 'dogge' not in vocabulary"
I think it's pretty self-explanatory from what I posted before that model uses incorrect wrapper, as it uses gensim.models.keyedvectors.Word2VecKeyedVectors
instead of gensim.models.FastText
Will it be resolved in future release?
Ping @mpenkov -- this is the same issue as on that mailing list (I knew I already saw it somewhere!). Really confusing behaviour.
@piskvorky @mpenkov could you help me pinpointing the problem? I may be willing to fix it, but for now I don't know where to start because I don't see what code gets called when creating the object in the downloader
AFAIR, it's this code in __init__.py
inside the fasttext-wiki-news-subwords-300
release:
https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300
@mpenkov can you confirm?
Hello from a random user. I am trying to get vectors from the said model but it produces an error saying that the word is not found in vocab. What should I do? Wait for a fix? Thanks
Any updates on this issue?
I'm curious too. @mpenkov can you please have a look? I know you reworked and clarified our FastText recently, thanks.
Sorry, I'm a little overwhelmed at the moment with work, travel and general end-of-year life-stuff. I'll have a look at this when I can, but I hope no-one out there is holding their breath :)
Has there been any update on this issue since 2019?