interpret-text icon indicating copy to clipboard operation
interpret-text copied to clipboard

Vocab Error when running 'Interpreting Classical Text Classification models' Notebook

Open Chris-hughes10 opened this issue 4 years ago • 3 comments

Using a clean Python 3.7 environment on Ubuntu, and installing interpret-text using pip, I am hitting an error when I try to walk through the 'Interpreting Classical Text Classification models' notebook; I have made no changes to the code.

When attempting to fit the model, on the line:

classifier, best_params = explainer.fit(X_train, y_train)

I get the following error:

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py:489: UserWarning: The parameter 'token_pattern' will not be used since 'tokenizer' is not None'
  warnings.warn("The parameter 'token_pattern' will not be used"
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-47f4fc43855d> in <module>
----> 1 classifier, best_params = explainer.fit(X_train, y_train)

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in fit(self, X_str, y_train)
     92         :rtype: list
     93         """
---> 94         X_train = self._encode(X_str)
     95         if self.is_trained is False:
     96             if self.model is None:

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/classical.py in _encode(self, X_str)
     61         :rtype: array_like (ndarray, pandas dataframe). Same rows as X_str
     62         """
---> 63         X_vec, _ = self.preprocessor.encode_features(X_str)
     64         return X_vec
     65 

/anaconda/envs/interpret/lib/python3.7/site-packages/interpret_text/experimental/common/utils_classical.py in encode_features(self, X_str, needs_fit, keep_ids)
    129         # needs_fit will be set to true if encoder is not already trained
    130         if needs_fit is True:
--> 131             self.vectorizer.fit(X_str)
    132         if isinstance(X_str, str):
    133             X_str = [X_str]

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit(self, raw_documents, y)
   1167         """
   1168         self._warn_for_unused_params()
-> 1169         self.fit_transform(raw_documents)
   1170         return self
   1171 

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
   1201 
   1202         vocabulary, X = self._count_vocab(raw_documents,
-> 1203                                           self.fixed_vocabulary_)
   1204 
   1205         if self.binary:

/anaconda/envs/interpret/lib/python3.7/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
   1131             vocabulary = dict(vocabulary)
   1132             if not vocabulary:
-> 1133                 raise ValueError("empty vocabulary; perhaps the documents only"
   1134                                  " contain stop words")
   1135 

ValueError: empty vocabulary; perhaps the documents only contain stop words

Am I missing something obvious here?

Chris-hughes10 avatar Jul 02 '21 14:07 Chris-hughes10

Hi @Chris-hughes10 I was having the same problem and it was a problem related with the env. What libraries and versions are you using?

RitaDS avatar Oct 01 '21 10:10 RitaDS

I had a similar issue, using an older version of spacy (2.3.7) package on pypi fixed it, looks like the tokenizer code needs to be updated to latest spacy

imatiach-msft avatar Feb 02 '22 14:02 imatiach-msft

see related issue: https://github.com/interpretml/interpret-text/issues/182

imatiach-msft avatar Feb 02 '22 14:02 imatiach-msft