gensim icon indicating copy to clipboard operation
gensim copied to clipboard

Get some numpy error when running doc2vec

Open leuchine opened this issue 9 years ago • 12 comments

Hi. I get the following error when running doc2vec

here is my code:

from gensim.models.doc2vec import LabeledSentence, Doc2Vec

class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(self.filename)):
            yield LabeledSentence(words=line.split(), tags=['SENT_%s' % uid])

sentences=LabeledLineSentence('id.txt')
model = Doc2Vec(alpha=0.025, min_alpha=0.025)  # use fixed learning rate
model.build_vocab(sentences.__iter__())
model.train(sentences)




Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim-0.12.3-py2.7-macosx-10.6-intel.egg/gensim/models/word2vec.py", line 729, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim-0.12.3-py2.7-macosx-10.6-intel.egg/gensim/models/doc2vec.py", line 672, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_inner.pyx", line 449, in gensim.models.doc2vec_inner.train_document_dm (./gensim/models/doc2vec_inner.c:5508)
    codes[i] = <np.uint8_t *>np.PyArray_DATA(predict_word.code)
TypeError: Cannot convert list to numpy.ndarray

leuchine avatar Jan 07 '16 08:01 leuchine

Hi, did you install gensim via OSX wheel?

tmylk avatar Jan 09 '16 19:01 tmylk

Hi:

Thanks for your email. Yes. I am using OSX wheel. I fixed that problem by replacing the model sentence to this one

model = Doc2Vec(min_count=1, window=10, size=5, sample=1e-4, negative=5, workers=7)

Hope it can help. Thanks for your reply. Gensim is a excellent tool :)

Best Regards, Liu Qi

On 2016-01-10 03:47, Lev Konstantinovskiy wrote:

Hi, did you install gensim via OSX wheel?

Reply to this email directly or view it on GitHub [1].

Links:

[1] https://github.com/piskvorky/gensim/issues/574#issuecomment-170274351

leuchine avatar Jan 10 '16 05:01 leuchine

@gojomo Should this be closed as resolved?

tmylk avatar Jan 23 '16 21:01 tmylk

Hi,

I've the exact same problem. What's wrong with OSX?

Regards,

Patrick.

patwat avatar Mar 25 '16 11:03 patwat

@patwat - do you get the exact same traceback? What are your versions of gensim & numpy, and what are your initialization parameters for the Doc2Vec model?

gojomo avatar Mar 26 '16 03:03 gojomo

Hello,

Something weird: the error only happens in notebook (python 3). When I run the exact same code from a file, everything is ok.

The code:

d2v_parameters = {
    "iterations": 10,

    "size": 300,
    "alpha": 0.025,
    "window": 8,
    "min_count": 5,
    "max_vocab_size": None,
    "sample": 0,
    "seed": 1,
    "workers": 1,
    "min_alpha": 0.0001,
    "dm": 1,
    "hs": 1,
    "negative": 0,
    "dbow_words": 0,
    "dm_mean": 0,
    "dm_concat": 0,
    "dm_tag_count": 1,
    "docvecs": None,
    "docvecs_mapfile": None,
    "comment": None,
    "trim_rule": None
}

corpus = TaggedCorpusIterator(training_files, global_config)

d2v = gensim.models.doc2vec.Doc2Vec(size=d2v_parameters["size"],
                                    alpha=d2v_parameters["alpha"],
                                    window=d2v_parameters["window"],
                                    min_count=d2v_parameters["min_count"],
                                    max_vocab_size=d2v_parameters["max_vocab_size"],
                                    sample=d2v_parameters["sample"],
                                    seed=d2v_parameters["seed"],
                                    workers=d2v_parameters["workers"],
                                    min_alpha=d2v_parameters["min_alpha"],
                                    dm=d2v_parameters["dm"],
                                    hs=d2v_parameters["hs"],
                                    negative=d2v_parameters["negative"],
                                    dbow_words=d2v_parameters["dbow_words"],
                                    dm_mean=d2v_parameters["dm_mean"],
                                    dm_concat=d2v_parameters["dm_concat"],
                                    dm_tag_count=d2v_parameters["dm_tag_count"],
                                    docvecs=d2v_parameters["docvecs"],
                                    docvecs_mapfile=d2v_parameters["docvecs_mapfile"],
                                    comment=d2v_parameters["comment"],
                                    trim_rule=d2v_parameters["trim_rule"])

d2v.build_vocab(corpus)

for epoch in range(d2v_parameters["iterations"]):
    d2v.train(corpus)
    d2v.alpha -= 0.002
    d2v.min_alpha = d2v.alpha

d2v.save(d2v_model)

The error:

Exception in thread Thread-7:
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 914, in _bootstrap_inner
    self.run()
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 862, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gensim-0.12.4-py3.5-macosx-10.11-x86_64.egg/gensim/models/word2vec.py", line 735, in worker_loop
    tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gensim-0.12.4-py3.5-macosx-10.11-x86_64.egg/gensim/models/doc2vec.py", line 672, in _do_train_job
    doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
  File "gensim/models/doc2vec_inner.pyx", line 449, in gensim.models.doc2vec_inner.train_document_dm (./gensim/models/doc2vec_inner.c:5508)
TypeError: Cannot convert list to numpy.ndarray

Patrick.

patwat avatar Mar 27 '16 13:03 patwat

My guess would be that the environment where you're getting the error has some mismatch in the gensim & numpy versions, perhaps something has changed versions from when each's native code was compiled. In both the working and non-working environments, I would collect – (a) python version; (b) numpy version; (c) gensim version – and see if there are any discrepancies. But even if no discrepancy, it's possible something about the update-ordering created a mismatch, so I would also try uninstalling numpy/scipy/gensim in the relevant environment, verifying they're gone (that you're clear on which environment is active), and then reinstalling them.

gojomo avatar Mar 28 '16 05:03 gojomo

I got once the same error with Word2Vec, it happens when using an option that necessitate the Huffman Tree, and when the vocabulary contains only one word / element to be assigned a code : in this case, the list containing its binary code is empty, and when we try to create an array in Cython from it, then the bug occurs.

When you changed the word embedding build option, and set min_count to 1, then I guess the vocabulary contained strictly more than one word, so the corresponding lists weren't empty anymore, and the bug disappeared.

Don't understand why that would depend on the environment though. Maybe because for some environments, the corresponding executables created when compiling the cython script do not throw an exception when passed an empty list (much like pure python, which sends back an empty array in this case) ?

fnoyez avatar May 02 '16 10:05 fnoyez

Ping @mpenkov can we still reproduce this?

piskvorky avatar Oct 08 '19 09:10 piskvorky

The problem can be reproduced for the Word2Vec model if your whole training corpus contains only a single token (no matter how many times this token occurs). Even if the min_count is set to 1.

hafiz031 avatar Feb 16 '22 05:02 hafiz031

The problem can be reproduced for the Word2Vec model if your whole training corpus contains only a single token (no matter how many times this token occurs). Even if the min_count is set to 1.

Do you have example code & the related traceback fortriggering such an error in the latest Gensim? Even if the error looks similar, after all the code refactoring related classes have gone through in the last 4y, it may not be the same problem.

And if it only happens in non-default hs=1 mode, when training with a toy corpus (of only one unique token) that couldn't possibly generate a useful Doc2Vec/Word2Vec/etc model, we may want to improve the error-message/handling, but generally the code shouldn't be expected to do much useful in such a corner case.

gojomo avatar Feb 19 '22 01:02 gojomo

@gojomo I just tried it on latest gensim==4.1.2. And yes, it only happens when you set hs = 1 and there is only one token in the entire corpus. I have modified the official example mentioned here to reproduce it. Following it is:

from gensim.models import Word2Vec
sentences = [["a", "a", "a"], ["a", "a", "a"]]
model = Word2Vec(sentences, min_count=1, hs = 1)

Hope it helps.

hafiz031 avatar Feb 19 '22 04:02 hafiz031