Vocab Error when running 'Interpreting Classical Text Classification models' Notebook
Using a clean Python 3.7 environment on Ubuntu, and installing interpret-text using pip, I am hitting an error when I try to walk through the 'Interpreting Classical Text Classification models' notebook; I have made no changes to the code.
When attempting to fit the model, on the line:
classifier, best_params = explainer.fit(X_train, y_train)
I get the following error:
/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
warnings.warn("The parameter 'token_pattern' will not be used"
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-47f4fc43855d> in <module>
----> 1 classifier, best_params = explainer.fit(X_train, y_train)
/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in fit(self, X_str, y_train)
92 :rtype: list
93 """
---> 94 X_train = self._encode(X_str)
95 if self.is_trained is False:
96 if self.model is None:
/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in _encode(self, X_str)
61 :rtype: array_like (ndarray, pandas dataframe). Same rows as X_str
62 """
---> 63 X_vec, _ = self.preprocessor.encode_features(X_str)
64 return X_vec
65
/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/common/utils_classical.py in encode_features(self, X_str, needs_fit, keep_ids)
129 # needs_fit will be set to true if encoder is not already trained
130 if needs_fit is True:
--> 131 self.vectorizer.fit(X_str)
132 if isinstance(X_str, str):
133 X_str = [X_str]
/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, raw_documents, y)
1167 """
1168 self._warn_for_unused_params()
-> 1169 self.fit_transform(raw_documents)
1170 return self
1171
/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
1201
1202 vocabulary, X = self._count_vocab(raw_documents,
-> 1203 self.fixed_vocabulary_)
1204
1205 if self.binary:
/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
1131 vocabulary = dict(vocabulary)
1132 if not vocabulary:
-> 1133 raise ValueError("empty vocabulary; perhaps the documents only"
1134 " contain stop words")
1135
ValueError: empty vocabulary; perhaps the documents only contain stop words
Am I missing something obvious here?
Hi @Chris-hughes10 I was having the same problem and it was a problem related with the env. What libraries and versions are you using?
I had a similar issue, using an older version of spacy (2.3.7) package on pypi fixed it, looks like the tokenizer code needs to be updated to latest spacy
see related issue: https://github.com/interpretml/interpret-text/issues/182