skorch Support for sparse arrays

Highly intrigued by this package! Thanks for releasing it :)

Does skorch have support for sparse arrays? For example, if I run the following pipeline:

pipe = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('net', net)
])
pipe.fit(X_train, y_train)

where X_train is a numpy array of strings, and y_train are the labels, also numpy arrays of type int, I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-1d40af031c7c> in <module>()
----> 1 pipe.fit(X_train, y_train)

/anaconda/envs/skorch/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    257         Xt, fit_params = self._fit(X, y, **fit_params)
    258         if self._final_estimator is not None:
--> 259             self._final_estimator.fit(Xt, y, **fit_params)
    260         return self
    261 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
   1170         # this is actually a pylint bug:
   1171         # https://github.com/PyCQA/pylint/issues/1085
-> 1172         return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
   1173 
   1174 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
    630             self.initialize()
    631 
--> 632         self.partial_fit(X, y, **fit_params)
    633         return self
    634 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in partial_fit(self, X, y, classes, **fit_params)
    595         self.notify('on_train_begin')
    596         try:
--> 597             self.fit_loop(X, y)
    598         except KeyboardInterrupt:
    599             pass

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit_loop(self, X, y, epochs)
    521 
    522         if self.train_split:
--> 523             X_train, X_valid, y_train, y_valid = self.train_split(X, y)
    524             dataset_valid = self.get_dataset(X_valid, y_valid)
    525         else:

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in __call__(self, X, y)
    308 
    309         # pylint: disable=invalid-name
--> 310         len_X = get_len(X)
    311         if y is not None:
    312             len_y = get_len(y)

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in get_len(data)
     39 
     40 def get_len(data):
---> 41     lens = [_apply_to_data(data, len, unpack_dict=True)]
     42     lens = list(flatten(lens))
     43     len_set = set(lens)

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in _apply_to_data(data, func, unpack_dict)
     35         except TypeError:
     36             return func(data)
---> 37     return func(data)
     38 
     39 

/anaconda/envs/skorch/lib/python3.5/site-packages/scipy/sparse/base.py in __len__(self)
    264     # non-zeros is more important.  For now, raise an exception!
    265     def __len__(self):
--> 266         raise TypeError("sparse matrix length is ambiguous; use getnnz()"
    267                         " or shape[0]")
    268 

TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]

print(platform.platform())
print("Python:", sys.version)
print("PyTorch:", torch.__version__)
print("Skorch:", skorch.__version__)
print("Scikit-Learn:", sklearn.__version__)
Linux-4.4.0-91-generic-x86_64-with-debian-stretch-sid
Python: 3.5.3 |Anaconda custom (64-bit)| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
PyTorch: 0.3.0.post4
Skorch: 0.1.0.post1
Scikit-Learn: 0.19.0

Do I need to wrap the output of the transformer prior to the NeuralNetClassifier around a FunctionTransformer and convert to dense arrays first?

Dec 13 '17 04:12 akzaidi

Good catch, we haven't tried out sparse matrices yet. Unfortunately, they won't work for numerous reasons. I tried to make it work with skorch, but then pytorch makes trouble (no indexing, no .size on sparse tensors).

As soon as pytorch sparse tensors are better supported, we should make sure skorch is able to work with them. Currently, it requires too many workarounds. Therefore, as you suggested, the best bet is to cast to a dense array:

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=MAX_FEATURES)),  # better set a limit
    ('tofloat32', FunctionTransformer(lambda x: x.astype(np.float32), accept_sparse=True)),
    ('toarray', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
    ('net', net),
])

PS: I pushed some changes on the skorch side to make sparse matrices work: https://github.com/dnouri/skorch/tree/feature/sparse-support. As mentioned, it still won't work, though, because sparse tensors don't support some vital features (yet).

Dec 13 '17 09:12 benjamin-work

Ah okay, thanks for the info!

Dec 13 '17 20:12 akzaidi

Sorry to bug you again!

This is confusing me:

class LSTMClassifier(nn.Module):

    def __init__(self, 
                 D_in=MAX_FEATURES,
                 EMBED_SIZE=150,
                 H=10, D_out=2):
        super(LSTMClassifier, self).__init__()
        self.hidden_dim = H
        self.word_embeddings = nn.Embedding(D_in, EMBED_SIZE)
        self.lstm = nn.LSTM(EMBED_SIZE, H)
        self.hidden2label = nn.Linear(H, D_out)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # the first is the hidden h
        # the second is the cell  c
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        x = embeds.view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(x, self.hidden)
        y  = self.hidden2label(lstm_out[-1])
        log_probs = F.log_softmax(y)
        return log_probs

lstmnet = NeuralNetClassifier(
        LSTMClassifier,
        max_epochs=100,
        lr=0.02,
        use_cuda=False, warm_start=False)

lstm = Pipeline([
    ('tokenizer', CountVectorizer(max_features=MAX_FEATURES)),
    ('toint', FunctionTransformer(lambda x: x.astype(np.int32), accept_sparse=True)),
    ('toarray', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
    ('net', lstmnet)
])

lstm.fit(X_train, y_train.astype)

Why does the module re-initialize despite warm_start = False?
Do I need to cast the index to long? If so, how do I do so?

Re-initializing module!
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-4b32b563fee8> in <module>()
----> 1 lstm.fit(X_train, y_train)

/anaconda/envs/skorch/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
    257         Xt, fit_params = self._fit(X, y, **fit_params)
    258         if self._final_estimator is not None:
--> 259             self._final_estimator.fit(Xt, y, **fit_params)
    260         return self
    261 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
   1170         # this is actually a pylint bug:
   1171         # https://github.com/PyCQA/pylint/issues/1085
-> 1172         return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
   1173 
   1174 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
    630             self.initialize()
    631 
--> 632         self.partial_fit(X, y, **fit_params)
    633         return self
    634 

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in partial_fit(self, X, y, classes, **fit_params)
    595         self.notify('on_train_begin')
    596         try:
--> 597             self.fit_loop(X, y)
    598         except KeyboardInterrupt:
    599             pass

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit_loop(self, X, y, epochs)
    540             for Xi, yi in self.get_iterator(dataset_train, training=True):
    541                 self.notify('on_batch_begin', X=Xi, y=yi, training=True)
--> 542                 loss = self.train_step(Xi, yi)
    543                 self.history.record_batch('train_loss', loss.data[0])
    544                 self.history.record_batch('train_batch_size', len(Xi))

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in train_step(self, Xi, yi)
    462         self.module_.train()
    463         self.optimizer_.zero_grad()
--> 464         y_pred = self.infer(Xi)
    465         loss = self.get_loss(y_pred, yi, X=Xi, training=True)
    466         loss.backward()

/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in infer(self, x)
    701         if isinstance(x, dict):
    702             return self.module_(**x)
--> 703         return self.module_(x)
    704 
    705     def predict_proba(self, X):

/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    323         for hook in self._forward_pre_hooks.values():
    324             hook(self, input)
--> 325         result = self.forward(*input, **kwargs)
    326         for hook in self._forward_hooks.values():
    327             hook_result = hook(self, input, result)

<ipython-input-10-7ef9aef0510e> in forward(self, sentence)
     19 
     20     def forward(self, sentence):
---> 21         embeds = self.word_embeddings(sentence)
     22         x = embeds.view(len(sentence), 1, -1)
     23         lstm_out, self.hidden = self.lstm(x, self.hidden)

/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    323         for hook in self._forward_pre_hooks.values():
    324             hook(self, input)
--> 325         result = self.forward(*input, **kwargs)
    326         for hook in self._forward_hooks.values():
    327             hook_result = hook(self, input, result)

/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    101             input, self.weight,
    102             padding_idx, self.max_norm, self.norm_type,
--> 103             self.scale_grad_by_freq, self.sparse
    104         )
    105 

/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/_functions/thnn/sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
     57             output = torch.index_select(weight, 0, indices)
     58         else:
---> 59             output = torch.index_select(weight, 0, indices.view(-1))
     60             output = output.view(indices.size(0), indices.size(1), weight.size(1))
     61 

TypeError: torch.index_select received an invalid combination of arguments - got (torch.FloatTensor, int, torch.IntTensor), but expected (torch.FloatTensor source, int dim, torch.LongTensor index)

Dec 14 '17 01:12 akzaidi

I believe you are taking the wrong approach here. CountVectorizer is not a tokenizer, it creates row-wise word counts. nn.Embedding expects indices, though. Here is a short snippet that creates indices:

# requires: $ pip install dstoolbox
from dstoolbox.transformers import Padder2d

pipe = Pipeline([
    ('count', CountVectorizer(max_features=MAX_FEATURES)),
    ('to_index', FunctionTransformer(lambda X: [x.nonzero()[1] for x in X], accept_sparse=True)),
    ('pad', Padder2d(MAX_LEN, dtype=int)),
])
pipe.fit_transform(X)
# returns
array([[32,  0,  0,  0,  0,  0,  0,  0,  0,  0],
       [ 5, 31, 22, 34, 39,  0,  0,  0,  0,  0],
       [15,  9,  0,  0,  0,  0,  0,  0,  0,  0],
       [46, 27, 44,  0,  0,  0,  0,  0,  0,  0],
       ...
       [ 4, 32,  0,  0,  0,  0,  0,  0,  0,  0]])

You also need to remember that sklearn is batch-first whereas pytorch's RNN layers are not by default. I hope that helps.

Dec 14 '17 09:12 benjamin-work

hi @akzaidi , we just merged a notebook that shows how to best use skorch and RNNs for a classification task. I hope this helps with your task.

Jan 10 '18 12:01 benjamin-work

Thank you! That worked great and I'm repurposing to try out CNNs and Q-RNNs for classification. Awesome work!

One quick question - is there an easy way to implement multi-GPU hyperparameter search when using GridSearchCV or RandomizedSearchCV? I tried using the n_jobs argument set to number of GPU devices I have but that still only worked sequentially.

Thanks!

Jan 16 '18 19:01 akzaidi

Glad to hear that you make progress.

is there an easy way to implement multi-GPU hyperparameter search when using GridSearchCV or RandomizedSearchCV?

This is not possible yet. We have a ticket for that but it won't be trivial. n_jobs in sklearn means CPU jobs, since it uses joblib under the hood. We will have to see how to achieve this with multiple GPUs instead (also, it is hard to test). If you find a solution or have an idea, feel free to suggest it there.

Jan 17 '18 08:01 benjamin-work

skorch skorch copied to clipboard

Support for sparse arrays

skorch
skorch copied to clipboard