gensim
gensim copied to clipboard
Segmentation fault using build_vocab(..., update=True) for Doc2Vec
Hello!
I'm performing online learning for Doc2Vec
, that is, I learn an initial model on a set of tagged documents and try to update the model on a new set of tagged documents. If the second set contains new tags (tags that were not present in the initial set of documents), then I usually get segmentation fault (this behavior is not deterministic, but it happens most of time).
Below you can find a toy example that reproduces the issue; and here is the output of that code. I'm using Python 3.4.3 and Gensim 0.13.3.
I've debugged with gdb and I've got the following output:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff9a4f8700 (LWP 29422)]
__pyx_f_6gensim_6models_13doc2vec_inner_fast_document_dm_hs (__pyx_v_learn_hidden=1, __pyx_v_size=300, __pyx_v_work=0x7fff80001480, __pyx_v_alpha=0.0250000004, __pyx_v_syn1=0x1693ce0, __pyx_v_neu1=0x7fff80001a00, __pyx_v_word_code_len=6,
__pyx_v_word_code=<optimized out>, __pyx_v_word_point=0x13fe410) at ./gensim/models/doc2vec_inner.c:2078
I'm willing to help fixing this issue if someone can provide me some guidance. Thanks!
Sample code that reproduces the issue:
import logging
from gensim.models.doc2vec import (
Doc2Vec,
TaggedDocument,
)
logging.basicConfig(
format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
level=logging.DEBUG,
)
def to_str(d):
return ", ".join(d.keys())
SENTS = [
"anecdotal using a personal experience or an isolated example instead of a sound argument or compelling evidence",
"plausible thinking that just because something is plausible means that it is true",
"occam razor is used as a heuristic technique discovery tool to guide scientists in the development of theoretical models rather than as an arbiter between published models",
"karl popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations",
"the successful prediction of a stock future price could yield significant profit",
]
SENTS = [s.split() for s in SENTS]
def main():
sentences_1 = [
TaggedDocument(SENTS[0], tags=['SENT_0']),
TaggedDocument(SENTS[1], tags=['SENT_0']),
TaggedDocument(SENTS[2], tags=['SENT_1']),
]
sentences_2 = [
TaggedDocument(SENTS[3], tags=['SENT_1']),
TaggedDocument(SENTS[4], tags=['SENT_2']),
]
model = Doc2Vec(min_count=1, workers=1)
model.build_vocab(sentences_1)
model.train(sentences_1)
print("-- Base model")
print("Vocabulary:", to_str(model.vocab))
print("Tags:", to_str(model.docvecs.doctags))
model.build_vocab(sentences_2, update=True)
model.train(sentences_2)
print("-- Updated model")
print("Vocabulary:", to_str(model.vocab))
print("Tags:", to_str(model.docvecs.doctags))
if __name__ == '__main__':
main()
Vocab expansion for doc2vec is not supported yet so labelled this as a new feature.
I ran into this also.. Was taking a look at how updating vocabulary worked in the online for word2vec and tried to replicate the update for doc2vec's doctags.
It seems to work - as in I can train the model with a few examples and then load it, train it more and it will return new doctags and vocabulary in the similarity functions. When storing the updated model I do have to give it a different filename otherwise the segmentation fault still happens. But the weights look like they get updated to. Here are my edits to the original doc2vec.py
In the DocvecsArray
class:
Added function to store new doctags from new training in a new property self.new_doctags = {}
def note_newdoctag(self, key, document_no, document_length, model):
if isinstance(key, int):
self.max_rawint = max(self.max_rawint, key)
else:
if key in self.doctags:
self.doctags[key] = self.doctags[key].repeat(document_length)
else:
self.doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
self.new_doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
self.offset2doctag.append(key)
self.new_count = self.max_rawint + 1 + len(self.offset2doctag)
Also an update weights function:
def update_weights(self, model):
gained_tags = len(self.doctags) - len(self.new_doctags)
gained_tags = len(self.new_doctags)
newsyn0 = empty((gained_tags, model.vector_size), dtype=REAL)
# # randomize the remaining tags
for i in xrange(len(self.new_doctags), len(self.doctags)):
# construct deterministic seed from word AND seed argument
newsyn0[i - len(self.doctag_syn0)] = model.seeded_vector(i + model.seed)
self.doctag_syn0 = vstack([self.doctag_syn0, newsyn0])
self.doctag_syn0_lockf = ones(len(self.doctags), dtype=REAL) # zeros suppress learning
In the Doc2Vec
class:
Then in the scan_vocab
function of the Doc2Vec
class, call the note_newdoctag
function when build_vocab
is called with update=True
:
for document_no, document in enumerate(documents):
...
if not update:
for tag in document.tags:
self.docvecs.note_doctag(tag, document_no, document_length, self)
else:
for tag in document.tags:
self.docvecs.note_newdoctag(tag, document_no, document_length, self)
...
When finalize_vocab
is called in the super class it doesnt run my new update weights in DocvecsArray
so I dropped finalize_vocab
into Doc2Vec
and added
self.docvecs.update_weights(self)
at the end of it.
Here is a link to the full file: https://gist.github.com/korostelevm/d48c80f296516deef045e5aa5dca1518
I just import import doc2vec_online as doc2vec
instead of from gensim.models import doc2vec
Disclaimer: I may not know what i'm doing at all, which is why im posting here for someone to hopefully verify
As @tmylk notes, the existing vocab-expansion feature (build_vocab(..., update=True)
) wasn't yet designed/tested for Doc2Vec use – so it might work (because of the significant code overlap), or fail in either subtle or extreme ways (lke a SegFault)... it's an unknown.
The times that it's not SegFaulting, there may still be silent corruption – just no memory accesses so bad that they trigger the fault.
Perhaps something in the Doc2Vec paths is still using lengths/references to data that wasn't refreshed by the build_vocab(..., update=True)
call?
Thats what it seemed like to me, I forced into the slow mode to debug it - At the top of doc2vec.py:
try:
from gensim.models.doc2vec_inner import train_document_dbow, train_document_dm, train_document_dm_concat
from gensim.models.word2vec_inner import FAST_VERSION # blas-adaptation shared from word2vec
logger.debug('Fast version of {0} is being used'.format(__name__))
print asdf
# except ImportError:
except Exception:
Then replaced the train
function from word2vec and changed if FAST_VERSION < 0:
to always run the python threading.
After this instead of getting a segmentation fault I get this in the traceback:
File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 771, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 912, in _do_train_job
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks)
File "/Users/mike/Dropbox/lsp/recommender/doc2vec_original.py", line 115, in train_document_dbow
context_locks=doctag_locks)
File "/usr/local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 269, in train_sg_pair
l1 = context_vectors[context_index] # input word (NN input/projection layer)
IndexError: index 10 is out of bounds for axis 0 with size 3
Which I think was trying to tell me the index 10 of my doctags is more than the 3 I had in there in the first round of training. So I did the stuff I mentioned above and it seemed to fix the issue. Put back the fast mode flags and it still works.
I used ddd to debug Cython code and it seemed that the segmentation fault appears at line 123 of doc2vec_inner.pyx: g = (1 - word_code[b] - f) * alpha
. Then it turned out that the mistake comes from lines:
if hs:
codelens[i] = <int>len(predict_word.code)
codes[i] = <np.uint8_t *>np.PyArray_DATA(predict_word.code)
points[i] = <np.uint32_t *>np.PyArray_DATA(predict_word.point)
With parameter hs of model set to 0 there are no mistakes (both Python 2. and 3., proved with ddd). So, proposed hotfix is to turn off hs mode when model is upgraded.
An appropriate hotfix would be to disable vocabulary expansion for doc2vec models, but a proper fix would be better
Yes, and the proper fix will require figuring out why the model, post-vocab-update, is using some older or incorrect arrays or sizes, and thus making an improper/illegal memory access.
Current status: only works for hs=0. Hotfix needed: disable for hs > 0.
Looks like I'm still getting a segfault when hs=0. (Based on the doc2vec.py:590, it looks like that is the default, though the docs say it's 1.)
def get_doc2vec():
return Doc2Vec(size=200,
iter=1,
min_count=30,
workers=multiprocessing.cpu_count(),
dm=0)
def build_doc2vec(sentences,model=None,total_examples=None,i=0):
tagged_documents = [TaggedDocument(d,[i]) for d,i in zip(sentences,range(i,i+len(sentences)))]
if not model:
model = get_doc2vec()
model.build_vocab(tagged_documents)
else:
model.build_vocab(tagged_documents,update=True)
model.train(tagged_documents,total_examples=model.corpus_count,epochs=model.iter)
return (model,i+len(sentences))
Apologies if my code is unclear, but essentially I'm doing the same thing as others above. Any help would be much appreciated.
On a side note, I'm sure I'm using total_examples wrong, but when I put in the real total_examples count across all training calls, it says something like the expected count doesn't match the count for sentences on my current call.
Is it useful to call trian() function repeatedly on a Doc2Vec model without adding new vocabulary? Will the model get better for new data?
@rajivgrover009 Maybe. Whether it helps or hurts is probably dependent on your dataset, choice of parameters, and the relative contrast between your new texts and the earlier texts. The best-grounded course would be to mix new texts with old to make a new all-inclusive corpus, and continue training with that.
There's another report from @mullenba in #1578, which includes a minimal triggering case.
I'm trying to look into this. Here is a status update...
Previously, @tmylk reported that doc2vec's document expansions works as long as hs=0
. This isn't correct: it crashes if either negative != 0
(default: 5) or hs != 0
(default: 0). In other words, it is useless for all practical purposes.
To debug and iterate quickly, I used this workflow:
- change
doc2vec_inner.c
intodoc2vec_inner.pyx
at this line of the setup script, so that Cynthonize is invoked automatically every time there's a change in thepyx
file. - build with
CFLAGS='-Wall -O0 -g' python setup.py build
theninstall
. - gdb and cause the crash using using the minimal triggering case in #1578
The coredump points at this line, apparently the index is out of the bounds of EXP_TABLE
, which causes segfault.
The equivalent piece of code for word2vec is here. I've read that vocab expansion is supposed to work for word2vec, so I was planning to use that as a guide to check the differences.
Anyone wants to join me in this debugging adventure? 😄
ps: by the way, I tried to deliberately run the "slow" pure-python implementation of doc2vec to see if vocab expansion works. Same problem, it crashes here because doctag_vectors
is apparently not expanded correctly and doctag_indexes
goes out of bounds.
The pure-python path isn't actually core-dump 'crashing', is it? (I'd think it'd have to be a printed exception, instead.)
Note that segfault crashes are often caused by earlier memory-corruption, rather than the exact line where they're triggered.
Note that segfault crashes are often caused by earlier memory-corruption, rather than the exact line where they're triggered.
Thanks, but in this case it seems that indeed the index is pointing outside of EXP_TABLE
. I still have to trace it back, though.
The pure-python path isn't actually core-dump 'crashing', is it?
Yes, it's not coredumping. As I said, it goes out of bounds when it reaches the first new doctag (i.e., "animals" at line 29 of this minimal code) as follows:
Traceback (most recent call last):
File "/x/y/threading.py", line 916, in _bootstrap_inner
self.run()
File "/x/y/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/word2vec.py", line 992, in worker_loop
tally, raw_tally = self._do_train_job(sentences, alpha, (work, neu1))
File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/doc2vec.py", line 752, in _do_train_job
doctag_vectors=doctag_vectors, doctag_locks=doctag_locks
File "/x/y/site-packages/gensim-3.2.0-py3.6-linux-x86_64.egg/gensim/models/doc2vec.py", line 162, in train_document_dm
l1 = np_sum(word_vectors[word2_indexes], axis=0) + np_sum(doctag_vectors[doctag_indexes], axis=0)
IndexError: index 1 is out of bounds for axis 0 with size 1
Please note that I had to add the line model.neg_labels = zeros(6)
in order for the "slow" version to work at all.
Pushed this fix for the "slow" version.
Regarding the cythonized version... I'd need more time (and help).
Sure, but why would the index be out of the expected, functioning range? Often because of some (arbitrarily-)earlier memory-corruption.
@gojomo I received one more report with this problem, maybe raise an exception for this case (when update=True
), because this happens often and often (until we repair the bug itself only).
Hi any update on this issue.
I am able to train doc2vec model with new documents in a 32 bit python(for 64 bit python, it still crashes), but cannot query "model.docvecs.most_similar(["XXX"])" for newly added documents. shows index out for range.
An online approach for doc2vec will be very helpful.
@khulasaandh as I know, you can infer_vector
for new document & calculate needed similarity values.
Hi @menshikh-iv , thanks for the reply.
I am using the same example posted by @danoneata, but have added a few more documents/lines in sentences_1 and sentences_2. As you mentioned, I am computing the infer vector for new document as mentioned below.
infer_vector = model.infer_vector(token_list)
print(model.docvecs.most_similar(positive=[infer_vector]))
It returns me most similar documents but give gives nan values in place of similarity coefficient. [('SENT_0', nan), ('SENT_1', nan), ('SENT_2', nan)]
Am i doing this wrong?
@khulasaandh looks really suspicious (your code is correct). Can you share data (traned model & token_list) for reproducing this error?
@khulasaandh @menshikh-iv A separate non-segfault anomaly with infer_vector()
would be best diagnosed on the discussion list, or a new issue dedicated to that specific problem.
Hi @menshikh-iv and @gojomo even on 32bit python that I am using, sometimes the segmentation fault still occurs, but most of the time the code runs.
My python version -
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:38:48) [MSC v.1900 32 bit (Intel)] on win32
please find the code below to replicate the issue.
import logging
from gensim.models.doc2vec import (
Doc2Vec,
TaggedDocument,
)
logging.basicConfig(
format='%(asctime)s : %(threadName)s : %(levelname)s : %(message)s',
level=logging.DEBUG,
)
def to_str(d):
return ", ".join(d.keys())
SENTS = [
"anecdotal using a personal experience or an isolated example instead of a sound argument or compelling evidence",
"plausible thinking that just because something is plausible means that it is true",
"occam razor is used as a heuristic technique discovery tool to guide scientists in the development of theoretical models rather than as an arbiter between published models",
"karl popper argues that a preference for simple theories need not appeal to practical or aesthetic considerations",
"the successful prediction of a stock future price could yield significant profit",
]
SENTS = [s.split() for s in SENTS]
def main():
sentences_1 = [
TaggedDocument(SENTS[0], tags=['SENT_0']),
TaggedDocument(SENTS[1], tags=['SENT_1']),
TaggedDocument(SENTS[2], tags=['SENT_2']),
]
sentences_2 = [
TaggedDocument(SENTS[3], tags=['SENT_3']),
TaggedDocument(SENTS[4], tags=['SENT_4']),
]
model = Doc2Vec(min_count=1, workers=4)
model.build_vocab(sentences_1)
model.train(sentences_1, total_examples=model.corpus_count, epochs=model.iter)
print("-- Base model")
print("Vocabulary:", to_str(model.wv.vocab))
print("Tags:", to_str(model.docvecs.doctags))
model.build_vocab(sentences_2, update=True)
model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
print("-- Updated model")
print("Vocabulary:", to_str(model.wv.vocab))
print("Tags:", to_str(model.docvecs.doctags))
token_list = "the successful prediction of a stock future price could yield significant profit".split()
infer_vector = model.infer_vector(token_list)
print(model.docvecs.most_similar(positive=[infer_vector]))
if __name__ == '__main__':
main()
Big thanks @khulasaandh, reproduced with Python 2.7.14 (default, Sep 23 2017, 22:06:14) [GCC 7.2.0] on linux2
Segfault moment
In [6]: model.train(sentences_2, total_examples=model.corpus_count, epochs=model.iter)
/home/ivan/.virtualenvs/math/bin/ipython:1: DeprecationWarning: Call to deprecated `iter` (Attribute will be removed in 4.0.0, use self.epochs instead).
#!/home/ivan/.virtualenvs/math/bin/python
2018-03-28 02:18:17,204 : MainThread : INFO : training model with 4 workers on 68 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2018-03-28 02:18:17,207 : Thread-79 : DEBUG : job loop exiting, total 1 jobs
Segmentation fault (core dumped)
Does anyone have a workaround until this gets fixed?
Hello,
I'm currently trying to get gensim to train up a couple of TaggedDocument objects, which originate from a non static source of input-data. Or to put it differently: I need to add non predictable TaggedDocument objects to my doc2vec model on regular base. And - you might guessed it - ran into the same problem as you did.
So its gensim 3.8.0 on a Linux Debian Buster, 64bit.
The workaround offered by nsfinkelstein didn't work at all (beside, I do not know the size of my dictionary) which is sad... and probably caused by my poor Python experience (about.... two weeks?). But (!) I noticed something:
If you are about to add new content to you dictionary, it will go straight into segmentation fault if done in a way one would expect: put new TaggedDocument into the model by using model.build_vocab(documents=newTD, update=True) and then calling model.train(newTD) But by implementing the workaround in a wrong way I noticed that adding TaggedDocuments that are kind of identical to whatever is already present in the vocabulary wont trigger the segmentation fault.
here... look at these:
td1 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10'], tags=[]),
td2 = TaggedDocument(words=['11','12','13','14','15'], tags=[]),
As you can see, the second one is kind of logical extension of the first one. And as you might have observed, the dictionary will add one entry for every word in about the order its put inside.
So after td1 has been added to the vocab, asking for the vocab will yield
'1','2','3','4','5','6','7','8','9','10'
Now one would tend to add td2, but this this will cause the fragmentation fault as soon as we call model.train(td2)
But if you do it this way:
td1 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10'], tags=[]),
td2 = TaggedDocument(words=['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15'], tags=[]),
you can actually train after adding td2 to the vocab.
It will get a bit harder when you need to insert words to the vocab
td1= TaggedDocument(words='Im','very','confused','and','astonished','about','almost','all','and','everything'] tags=[])
td2= TaggedDocument(words=['I','like','cats','and','docs'] tags=[])
td1's vocabulary representation would omit the second 'and', so it would look like this
'Im','very','confused','and','astonished','about','almost','all','everything'
if you want to repeat the effect from the numbers I described, more work is needed. One need to extract the existing vocabulary and add all words that are NOT already inside the vocab in the order they appear, and offer all of this as new TaggedDocument:
td3 = TaggedDocument(words='Im','very','confused','and','astonished','about','almost','all','everything','I','like','cats','dogs'] tags=[])
Offering this "build_vocab(td3, update=True)" will allow you to train the existing model with td2
But... yes, there is always a but... while this does work with text (documents/words), as soon as you are trying to add tags to the whole thing, it will went back to segmentation fault itself to death. Not even the "offer a special TaggedDocument" trick can solve this :(
And this brought me into a dead end, because I really need those tags... Any chance someone might find a solution for this?
Hello, @korostelevm I tried to run your code with gensim 4.1.2 and it failed. Perhaps you could share the environment you used to run this code?