lda2vec icon indicating copy to clipboard operation
lda2vec copied to clipboard

error with spacy

Open openmotion opened this issue 8 years ago • 22 comments

hello i have this error in hacker_news/data python preprocess.py Traceback (most recent call last): File "preprocess.py", line 47, in merge=True) File "build/bdist.linux-x86_64/egg/lda2vec/preprocess.py", line 76, in tokenize author_name = authors.categories File "spacy/tokens/doc.pyx", line 250, in noun_chunks (spacy/tokens/doc.cpp:8013) File "spacy/syntax/iterators.pyx", line 11, in english_noun_chunks (spacy/syntax/iterators.cpp:1559) File "spacy/tokens/doc.pyx", line 100, in spacy.tokens.doc.Doc.getitem (spacy/tokens/doc.cpp:4890) IndexError: list index out of range

openmotion avatar Jun 24 '16 14:06 openmotion

Probably this error? So I think update SpaCy.

https://github.com/spacy-io/spaCy/issues/375

tokestermw avatar Jun 24 '16 18:06 tokestermw

The problem is indeed in SpaCy. As suggested, the workaround is to write: for phrase in list(doc.noun_chunks): instead of for phrase in doc.noun_chunks:

The merge() in place invalidates the iterator.

rchari51 avatar Aug 08 '16 18:08 rchari51

@rchari51 This is what I get after manually making the changes to spacy/tokens/doc.pyx and lda2vec/preprocess.py:

Traceback (most recent call last):
  File "data/preprocess.py", line 47, in <module>
    merge=True)
  File "/home/ubuntu/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
    while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
  File "spacy/tokens/span.pyx", line 54, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:3817)
  File "spacy/tokens/span.pyx", line 97, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:4975)
IndexError: Error calculating span: Can't find end

crawfordcomeaux avatar Sep 22 '16 20:09 crawfordcomeaux

@crawfordcomeaux - Were you able to resolve your issue. I ran into the same thing

saravp avatar Oct 19 '16 02:10 saravp

@saravp Only with merge=False, which doesn't really fit my use case.

crawfordcomeaux avatar Oct 19 '16 03:10 crawfordcomeaux

@saravp I just took a look at spaCy's issues to see if anything related to this stuck out & they just shipped version 1.0. Does anything change if you update spaCy?

crawfordcomeaux avatar Oct 19 '16 03:10 crawfordcomeaux

I'm seeing this error as well, using spacy from master. Commit d8db648ebf70e4bddfe21cad50a34891e4b75154

File "data/preprocess.py", line 47, in <module>
merge=True)
File "/Users/grivescorbett/projects/lda2vec/lda2vec/preprocess.py", line 78, in tokenize
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
File "spacy/tokens/span.pyx", line 65, in spacy.tokens.span.Span.__len__ (spacy/tokens/span.cpp:4142)
File "spacy/tokens/span.pyx", line 130, in spacy.tokens.span.Span._recalculate_indices (spacy/tokens/span.cpp:5339)

grivescorbett avatar Nov 01 '16 16:11 grivescorbett

@grivescorbett @crawfordcomeaux @openmotion @saravp @cemoody May it be that this is an indention issue? I think the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 would make more sense to unindent by one level. Then this error disappears. I think the inner loop is messing with the spans and people get this error. Honibols current code does seem to have the unident (see) https://github.com/explosion/sense2vec/blob/master/bin/merge_text.py#L95-L96

To reproduce the error:

s = u"""Marijuana is not the gateway drug alcohol is. I was introduced to alcohol at age of ten. I was introduced to marijuana at age of 14 . I was introduced to cocaine and crack at the age 17 & 18 . upon being introduced to crack I became addicted to crack & left marijuana alone."""
tokenize([s], max_length, skip=-2, attr=LOWER, merge=True, nlp=None, **kwargs):

this works for me now :)

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs):
    """"""
    if nlp is None:
        nlp = English()
    data = np.zeros((len(texts), max_length), dtype='int32')
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
            # Iterate over named entities
            for ent in doc.ents:
                if len(ent) > 1:
                    # Merge them into single tokens
                    ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_EMAIL, LIKE_URL]).astype('int32')
        if len(dat) > 0:
            dat = dat.astype('int32')
            msg = "Negative indices reserved for special tokens"
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

NilsRethmeier avatar Dec 20 '16 14:12 NilsRethmeier

I was having this issue in Python 2.7 and tried the above fixes. Unfortunately, none of them solved the problem. I ended up trying it in Python 3.5 and it worked. Definitely an issue with the conversion of those huge uInt64 vals to Int32.

spnichol avatar Dec 23 '17 22:12 spnichol

Can you explain what you tried to make it work in python 3? I am still getting negative indices with both fixes.

bountrisv avatar Jan 19 '18 17:01 bountrisv

File "/home/aum/PycharmProjects/learn_p/venv/src/lda2vec/lda2vec/preprocess.py", line 35, in tokenize assert dat.min() >= 0, msg AssertionError: Negative indices reserved for special tokens

what should i do?

hirenaum97 avatar Jan 25 '18 05:01 hirenaum97

@hirenaum97 Hi, were you able to resolve the error? i got a similar one. Thanks.

gracegcy avatar Jan 29 '18 20:01 gracegcy

just change version of spacy to 1.9

hirenaum97 avatar Jan 30 '18 05:01 hirenaum97

Thanks to @hirenaum97 ,I've change my version to spacy to 1.9 and accepted @NilsRethmeier 's advice unindenting the code at https://github.com/cemoody/lda2vec/blob/master/lda2vec/preprocess.py#L85-L88 by one level. solved the problem on py2

CoreJa avatar May 06 '18 08:05 CoreJa

I have the error

numpy.core._internal.AxisError: axis -1 is out of bounds for array of dimension 0

caused by the line in corpus.py

specials = np.sort(self.specials.values())

which lead to the error in this line in corpus.py

self.keys_loose, self.keys_counts, n_keys = self._loose_keys_ordered()

which then lead to the error stated when I'm running the preprocessing.py (in the data folder), in line

corpus.finalize()

Does anyone have any idea how to solve this? Thanks a lot!

lovedatatiff avatar May 11 '18 14:05 lovedatatiff

@Core00077 did you manage to successfully run the run.py? I'm trying to duplicate the run for twentynews and am experiencing quite a few issues here

lovedatatiff avatar May 11 '18 14:05 lovedatatiff

@lovedatatiff sorry, i meant ive run preprocess.py successfully. actually my work pc doesnt have a nvidia GPU. so i couldnt run the whole project. but i think i did it the right way since this project was not maintened for a while, it's better to use lower version depandency like spacy==1.9.0

if possible, i would later message u when i successfully ran it OK

CoreJa avatar May 14 '18 05:05 CoreJa

@Core00077 Hey, that'd be amazing! thank you! I don't have a Nvidia GPU either and I'm not quite sure how to workaround this since I'm quite new to this - would you mind sharing your code with me if you've successfully run it? Look forward to your reply! :)

lovedatatiff avatar May 14 '18 08:05 lovedatatiff

@lovedatatiff Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this.

CoreJa avatar May 14 '18 12:05 CoreJa

that's amazing! I have a AWS deep learning GPU instance connected with ubuntu - would that work? Would you also mind sharing with me your notebook and I'll see if I could make it work?

On 14 May 2018 at 13:47, Core-Chan [email protected] wrote:

@lovedatatiff https://github.com/lovedatatiff Sure thing. But a Nvidia GPU is NECESSARY if u wanna run it ok since it imports cupy. I have a notebook with a Nvidia GPU but I don't have time. I am now working on my exams. I will let u know when i made some progress on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cemoody/lda2vec/issues/38#issuecomment-388804218, or mute the thread https://github.com/notifications/unsubscribe-auth/AS9PjXcY2h8xtI5wqq9dtGEbUDwuG7G6ks5tyXzYgaJpZM4I91EC .

lovedatatiff avatar May 14 '18 13:05 lovedatatiff

sorry, my notebook doesn't have a public ip address. i am not sure the aws GPU would work or not. Basically it would work.

CoreJa avatar May 14 '18 13:05 CoreJa

Hi everyone,

Did anybody solve this issue? I have spacy==2.0.5 and still getting this problem

LizaKoz avatar Jun 27 '18 09:06 LizaKoz