gensim
gensim copied to clipboard
LdaModel constructor is missing a dimensionality check
Problem description
When creating a new LdaModel and passing a malformed corpus you can end up with an error as in this SO question:
/usr/local/lib/python3.7/dist-packages/gensim/models/ldamodel.py in inference(self, chunk, collect_sstats)
651 # to Blei's original LDA-C code, cool!).
652 for d, doc in enumerate(chunk):
--> 653 if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
654 # make sure the term IDs are ints, otherwise np will get upset
655 ids = [int(idx) for idx, _ in doc]
TypeError: 'int' object is not subscriptable
What is the expected result? A more coherent error message.
What are you seeing instead? TypeError: 'int' object is not subscriptable
Steps/code/corpus to reproduce
The details are available in this SO question.
The suggestion is to change the condition from:
if len(doc) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
to
if len(doc) > 0 and len(doc[0]) > 0 and not isinstance(doc[0][0], six.integer_types + (np.integer,)):
That doesn't seem to be the right place for an input format check. If your corpus is malformed, then even if you skip the "force-term-ids-to-int" code path, the code will blow up somewhere else a little bit later.
I'd be in favour of adding a sanity check, with a nice user-friendly error message (such as checking the format of the first document for common input errors). But I think such check belongs higher up the stack though = close to the first place where Gensim receives such malformed input from the user. Not into an inner loop in inference.
any solution to this?