gensim-data icon indicating copy to clipboard operation
gensim-data copied to clipboard

Pretrained FastText doesn't handle OOV words

Open lambdaofgod opened this issue 5 years ago • 14 comments

Loading FastText using gensim.downloader returns KeyedVectors object. Why is that? In the model name (fasttext-wiki-news-subwords-300) it seems like it should be able to use algorithm's ability to encode OOV words, but now it doesn't do that.

Also, loading downloaded model (from path returned from gensim_data_downloader) using gensim.models.FastText doesn't work.

lambdaofgod avatar May 09 '19 17:05 lambdaofgod

Thanks for reporting. That does sound like a bug to me. CC @mpenkov can you please have a look?

piskvorky avatar May 09 '19 18:05 piskvorky

I agree that it sounds like a bug.

@lambdaofgod Could you please provide a reproducible example?

mpenkov avatar May 10 '19 00:05 mpenkov

import gensim.downloader
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')
model
<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fc5b3964f60>
model.get_vector('dogge')
KeyError: "word 'dogge' not in vocabulary"

Which is not something that you expect from a method that uses subword information.

lambdaofgod avatar May 11 '19 10:05 lambdaofgod

Thank you for providing the reproducible example. Can you please include the full stack trace?

mpenkov avatar May 11 '19 12:05 mpenkov

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-c007c0b2c10b> in <module>()
----> 1 fasttext_w2v_format.get_vector('dogge')

1 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/keyedvectors.py in word_vec(self, word, use_norm)
    450             return result
    451         else:
--> 452             raise KeyError("word '%s' not in vocabulary" % word)
    453 
    454     def get_vector(self, word):

KeyError: "word 'dogge' not in vocabulary"

I think it's pretty self-explanatory from what I posted before that model uses incorrect wrapper, as it uses gensim.models.keyedvectors.Word2VecKeyedVectors instead of gensim.models.FastText

lambdaofgod avatar May 11 '19 15:05 lambdaofgod

Will it be resolved in future release?

GladiatorX avatar Jun 13 '19 11:06 GladiatorX

Ping @mpenkov -- this is the same issue as on that mailing list (I knew I already saw it somewhere!). Really confusing behaviour.

piskvorky avatar Jun 13 '19 12:06 piskvorky

@piskvorky @mpenkov could you help me pinpointing the problem? I may be willing to fix it, but for now I don't know where to start because I don't see what code gets called when creating the object in the downloader

lambdaofgod avatar Jun 13 '19 13:06 lambdaofgod

AFAIR, it's this code in __init__.py inside the fasttext-wiki-news-subwords-300 release: https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300

@mpenkov can you confirm?

piskvorky avatar Jun 13 '19 13:06 piskvorky

Hello from a random user. I am trying to get vectors from the said model but it produces an error saying that the word is not found in vocab. What should I do? Wait for a fix? Thanks

a66as avatar Oct 05 '19 12:10 a66as

Any updates on this issue?

a66as avatar Nov 14 '19 09:11 a66as

I'm curious too. @mpenkov can you please have a look? I know you reworked and clarified our FastText recently, thanks.

piskvorky avatar Nov 14 '19 10:11 piskvorky

Sorry, I'm a little overwhelmed at the moment with work, travel and general end-of-year life-stuff. I'll have a look at this when I can, but I hope no-one out there is holding their breath :)

mpenkov avatar Nov 18 '19 15:11 mpenkov

Has there been any update on this issue since 2019?

cephcyn avatar Mar 14 '21 03:03 cephcyn