lda2vec IndexError: Error calculating span: Can't find end

Running on OX X 10.11.6 $ python --version Python 2.7.11 :: Anaconda custom (x86_64)

$ python preprocess.py Traceback (most recent call last): File "preprocess.py", line 47, in merge=True) File "build/bdist.macosx-10.5-x86_64/egg/lda2vec/preprocess.py", line 78, in tokenize # Chop timestamps into days File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.len (spacy/tokens/span.cpp:3955) File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5105) IndexError: Error calculating span: Can't find end

Related to: https://github.com/cemoody/lda2vec/issues/38

Apr 04 '17 18:04 dbl001

IndexError Traceback (most recent call last) in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=True) 48 del texts 49

/Users/davidlaxer/anaconda/lib/python2.7/site-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, merge, nlp, **kwargs) 76 for phrase in doc.noun_chunks: 77 # Only keep adjectives and nouns, e.g. "good ideas" ---> 78 while len(phrase) > 1 and phrase[0].dep_ not in bad_deps: 79 phrase = phrase[1:] 80 if len(phrase) > 1:

/Users/davidlaxer/anaconda/lib/python2.7/site-packages/spacy-1.7.3-py2.7-macosx-10.5-x86_64.egg/spacy/tokens/span.pyx in spacy.tokens.span.Span.len (spacy/tokens/span.cpp:3955)() 63 64 def len(self): ---> 65 self._recalculate_indices() 66 if self.end < self.start: 67 return 0

/Users/davidlaxer/anaconda/lib/python2.7/site-packages/spacy-1.7.3-py2.7-macosx-10.5-x86_64.egg/spacy/tokens/span.pyx in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5105)() 128 end = token_by_end(self.doc.c, self.doc.length, self.end_char) 129 if end == -1: --> 130 raise IndexError("Error calculating span: Can't find end") 131 132 self.start = start

IndexError: Error calculating span: Can't find end

Apr 04 '17 18:04 dbl001

Seems to work with merge=False:

tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, merge=False)

preprocess.py: line 46

Apr 04 '17 19:04 dbl001

I've run into similar issues (or the same issue) where merge=False resolves things, but what impact does that have on the results besides squashing the error?

Apr 11 '17 22:04 crawfordcomeaux

The merge option seems to merge nouns with other words into single tokens. I don't really think that it affects the shape of topics too much as LDA should be able to handle words by themselves anyway.

Jun 26 '17 14:06 AdrianTudC

I got the same issue. It could be solved by setting the "merge" option to "False".

tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4,                                                                                          
                                    merge=False) ##!!!!change here into False

Oct 12 '17 03:10 fivejjs

Hi I am just trying by giving 'merge=False'! May I know how much time will it take to run the 'tokenize' function?

Cheers Arav

Apr 25 '18 16:04 Aravinviju

Hi all

After I changed the 'merge = false', it is giving me the following error,

OverflowErrorTraceback (most recent call last) in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=False) 48 del texts 49

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, merge, nlp, **kwargs) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '<SKIP>' 108 return data, vocab

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in ((v,)) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '<SKIP>' 108 return data, vocab

vocab.pyx in spacy.vocab.Vocab.getitem()

OverflowError: can't convert negative value to uint64_t

any heads up on this? kindly help me out with this.

cheers Arav

Apr 26 '18 09:04 Aravinviju

You need to run python x64 and libs also on x64.

Apr 28 '18 05:04 AdrianTudC

Hi all

After I changed the 'merge = false', it is giving me the following error,

OverflowErrorTraceback (most recent call last) in () 45 texts = features.pop('comment_text').values 46 tokens, vocab = preprocess.tokenize(texts, max_length, n_threads=4, ---> 47 merge=False) 48 del texts 49

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in tokenize(texts, max_length, skip, attr, merge, nlp, **kwargs) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab

/usr/local/lib/python2.7/dist-packages/lda2vec-0.1-py2.7.egg/lda2vec/preprocess.pyc in ((v,)) 104 data[row, :length] = dat[:length, 0].ravel() 105 uniques = np.unique(data) --> 106 vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} 107 vocab[skip] = '' 108 return data, vocab

vocab.pyx in spacy.vocab.Vocab.getitem()

OverflowError: can't convert negative value to uint64_t

any heads up on this? kindly help me out with this.

cheers Arav

i'm getting this error too when i try to run preprocess.py , how to fix this ??

Feb 18 '21 11:02 fathia-ghribi

lda2vec lda2vec copied to clipboard

IndexError: Error calculating span: Can't find end

lda2vec
lda2vec copied to clipboard