gensim icon indicating copy to clipboard operation
gensim copied to clipboard

LdaModel constructor is missing a dimensionality check

Open sophros opened this issue 3 years ago • 2 comments

Problem description

When creating a new LdaModel and passing a malformed corpus you can end up with an error as in this SO question:

/usr/local/lib/python3.7/dist-packages/gensim/models/ldamodel.py in inference(self, chunk, collect_sstats)
    651         # to Blei's original LDA-C code, cool!).
    652         for d, doc in enumerate(chunk):
--> 653             if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
    654                 # make sure the term IDs are ints, otherwise np will get upset
    655                 ids = [int(idx) for idx, _ in doc]

TypeError: 'int' object is not subscriptable

What is the expected result? A more coherent error message.

What are you seeing instead? TypeError: 'int' object is not subscriptable

Steps/code/corpus to reproduce

The details are available in this SO question.

The suggestion is to change the condition from: if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):

to

if len(doc) > 0 and len(doc[0]) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):

sophros avatar Feb 23 '22 13:02 sophros

That doesn't seem to be the right place for an input format check. If your corpus is malformed, then even if you skip the "force-term-ids-to-int" code path, the code will blow up somewhere else a little bit later.

I'd be in favour of adding a sanity check, with a nice user-friendly error message (such as checking the format of the first document for common input errors). But I think such check belongs higher up the stack though = close to the first place where Gensim receives such malformed input from the user. Not into an inner loop in inference.

piskvorky avatar Feb 23 '22 18:02 piskvorky

any solution to this?

Ishmam97 avatar Jan 20 '23 16:01 Ishmam97