skorch
skorch copied to clipboard
Support for sparse arrays
Highly intrigued by this package! Thanks for releasing it :)
Does skorch
have support for sparse arrays? For example, if I run the following pipeline:
pipe = Pipeline([
('tfidf', TfidfVectorizer()),
('net', net)
])
pipe.fit(X_train, y_train)
where X_train
is a numpy array of strings, and y_train
are the labels, also numpy arrays of type int
, I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-23-1d40af031c7c> in <module>()
----> 1 pipe.fit(X_train, y_train)
/anaconda/envs/skorch/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
257 Xt, fit_params = self._fit(X, y, **fit_params)
258 if self._final_estimator is not None:
--> 259 self._final_estimator.fit(Xt, y, **fit_params)
260 return self
261
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
1170 # this is actually a pylint bug:
1171 # https://github.com/PyCQA/pylint/issues/1085
-> 1172 return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
1173
1174
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
630 self.initialize()
631
--> 632 self.partial_fit(X, y, **fit_params)
633 return self
634
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in partial_fit(self, X, y, classes, **fit_params)
595 self.notify('on_train_begin')
596 try:
--> 597 self.fit_loop(X, y)
598 except KeyboardInterrupt:
599 pass
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit_loop(self, X, y, epochs)
521
522 if self.train_split:
--> 523 X_train, X_valid, y_train, y_valid = self.train_split(X, y)
524 dataset_valid = self.get_dataset(X_valid, y_valid)
525 else:
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in __call__(self, X, y)
308
309 # pylint: disable=invalid-name
--> 310 len_X = get_len(X)
311 if y is not None:
312 len_y = get_len(y)
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in get_len(data)
39
40 def get_len(data):
---> 41 lens = [_apply_to_data(data, len, unpack_dict=True)]
42 lens = list(flatten(lens))
43 len_set = set(lens)
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/dataset.py in _apply_to_data(data, func, unpack_dict)
35 except TypeError:
36 return func(data)
---> 37 return func(data)
38
39
/anaconda/envs/skorch/lib/python3.5/site-packages/scipy/sparse/base.py in __len__(self)
264 # non-zeros is more important. For now, raise an exception!
265 def __len__(self):
--> 266 raise TypeError("sparse matrix length is ambiguous; use getnnz()"
267 " or shape[0]")
268
TypeError: sparse matrix length is ambiguous; use getnnz() or shape[0]
print(platform.platform())
print("Python:", sys.version)
print("PyTorch:", torch.__version__)
print("Skorch:", skorch.__version__)
print("Scikit-Learn:", sklearn.__version__)
Linux-4.4.0-91-generic-x86_64-with-debian-stretch-sid
Python: 3.5.3 |Anaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
PyTorch: 0.3.0.post4
Skorch: 0.1.0.post1
Scikit-Learn: 0.19.0
Do I need to wrap the output of the transformer prior to the NeuralNetClassifier
around a FunctionTransformer
and convert to dense arrays first?
Good catch, we haven't tried out sparse matrices yet. Unfortunately, they won't work for numerous reasons. I tried to make it work with skorch, but then pytorch makes trouble (no indexing, no .size on sparse tensors).
As soon as pytorch sparse tensors are better supported, we should make sure skorch is able to work with them. Currently, it requires too many workarounds. Therefore, as you suggested, the best bet is to cast to a dense array:
pipe = Pipeline([
('tfidf', TfidfVectorizer(max_features=MAX_FEATURES)), # better set a limit
('tofloat32', FunctionTransformer(lambda x: x.astype(np.float32), accept_sparse=True)),
('toarray', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
('net', net),
])
PS: I pushed some changes on the skorch side to make sparse matrices work: https://github.com/dnouri/skorch/tree/feature/sparse-support. As mentioned, it still won't work, though, because sparse tensors don't support some vital features (yet).
Ah okay, thanks for the info!
Sorry to bug you again!
This is confusing me:
class LSTMClassifier(nn.Module):
def __init__(self,
D_in=MAX_FEATURES,
EMBED_SIZE=150,
H=10, D_out=2):
super(LSTMClassifier, self).__init__()
self.hidden_dim = H
self.word_embeddings = nn.Embedding(D_in, EMBED_SIZE)
self.lstm = nn.LSTM(EMBED_SIZE, H)
self.hidden2label = nn.Linear(H, D_out)
self.hidden = self.init_hidden()
def init_hidden(self):
# the first is the hidden h
# the second is the cell c
return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
x = embeds.view(len(sentence), 1, -1)
lstm_out, self.hidden = self.lstm(x, self.hidden)
y = self.hidden2label(lstm_out[-1])
log_probs = F.log_softmax(y)
return log_probs
lstmnet = NeuralNetClassifier(
LSTMClassifier,
max_epochs=100,
lr=0.02,
use_cuda=False, warm_start=False)
lstm = Pipeline([
('tokenizer', CountVectorizer(max_features=MAX_FEATURES)),
('toint', FunctionTransformer(lambda x: x.astype(np.int32), accept_sparse=True)),
('toarray', FunctionTransformer(lambda x: x.toarray(), accept_sparse=True)),
('net', lstmnet)
])
lstm.fit(X_train, y_train.astype)
- Why does the module re-initialize despite
warm_start = False
? - Do I need to cast the index to long? If so, how do I do so?
Re-initializing module!
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-4b32b563fee8> in <module>()
----> 1 lstm.fit(X_train, y_train)
/anaconda/envs/skorch/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
257 Xt, fit_params = self._fit(X, y, **fit_params)
258 if self._final_estimator is not None:
--> 259 self._final_estimator.fit(Xt, y, **fit_params)
260 return self
261
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
1170 # this is actually a pylint bug:
1171 # https://github.com/PyCQA/pylint/issues/1085
-> 1172 return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
1173
1174
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit(self, X, y, **fit_params)
630 self.initialize()
631
--> 632 self.partial_fit(X, y, **fit_params)
633 return self
634
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in partial_fit(self, X, y, classes, **fit_params)
595 self.notify('on_train_begin')
596 try:
--> 597 self.fit_loop(X, y)
598 except KeyboardInterrupt:
599 pass
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in fit_loop(self, X, y, epochs)
540 for Xi, yi in self.get_iterator(dataset_train, training=True):
541 self.notify('on_batch_begin', X=Xi, y=yi, training=True)
--> 542 loss = self.train_step(Xi, yi)
543 self.history.record_batch('train_loss', loss.data[0])
544 self.history.record_batch('train_batch_size', len(Xi))
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in train_step(self, Xi, yi)
462 self.module_.train()
463 self.optimizer_.zero_grad()
--> 464 y_pred = self.infer(Xi)
465 loss = self.get_loss(y_pred, yi, X=Xi, training=True)
466 loss.backward()
/anaconda/envs/skorch/lib/python3.5/site-packages/skorch/net.py in infer(self, x)
701 if isinstance(x, dict):
702 return self.module_(**x)
--> 703 return self.module_(x)
704
705 def predict_proba(self, X):
/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
--> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)
<ipython-input-10-7ef9aef0510e> in forward(self, sentence)
19
20 def forward(self, sentence):
---> 21 embeds = self.word_embeddings(sentence)
22 x = embeds.view(len(sentence), 1, -1)
23 lstm_out, self.hidden = self.lstm(x, self.hidden)
/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
323 for hook in self._forward_pre_hooks.values():
324 hook(self, input)
--> 325 result = self.forward(*input, **kwargs)
326 for hook in self._forward_hooks.values():
327 hook_result = hook(self, input, result)
/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/modules/sparse.py in forward(self, input)
101 input, self.weight,
102 padding_idx, self.max_norm, self.norm_type,
--> 103 self.scale_grad_by_freq, self.sparse
104 )
105
/anaconda/envs/skorch/lib/python3.5/site-packages/torch/nn/_functions/thnn/sparse.py in forward(cls, ctx, indices, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
57 output = torch.index_select(weight, 0, indices)
58 else:
---> 59 output = torch.index_select(weight, 0, indices.view(-1))
60 output = output.view(indices.size(0), indices.size(1), weight.size(1))
61
TypeError: torch.index_select received an invalid combination of arguments - got (torch.FloatTensor, int, torch.IntTensor), but expected (torch.FloatTensor source, int dim, torch.LongTensor index)
I believe you are taking the wrong approach here. CountVectorizer
is not a tokenizer, it creates row-wise word counts. nn.Embedding
expects indices, though. Here is a short snippet that creates indices:
# requires: $ pip install dstoolbox
from dstoolbox.transformers import Padder2d
pipe = Pipeline([
('count', CountVectorizer(max_features=MAX_FEATURES)),
('to_index', FunctionTransformer(lambda X: [x.nonzero()[1] for x in X], accept_sparse=True)),
('pad', Padder2d(MAX_LEN, dtype=int)),
])
pipe.fit_transform(X)
# returns
array([[32, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 5, 31, 22, 34, 39, 0, 0, 0, 0, 0],
[15, 9, 0, 0, 0, 0, 0, 0, 0, 0],
[46, 27, 44, 0, 0, 0, 0, 0, 0, 0],
...
[ 4, 32, 0, 0, 0, 0, 0, 0, 0, 0]])
You also need to remember that sklearn is batch-first whereas pytorch's RNN layers are not by default. I hope that helps.
hi @akzaidi , we just merged a notebook that shows how to best use skorch and RNNs for a classification task. I hope this helps with your task.
Thank you! That worked great and I'm repurposing to try out CNNs and Q-RNNs for classification. Awesome work!
One quick question - is there an easy way to implement multi-GPU hyperparameter search when using GridSearchCV
or RandomizedSearchCV
? I tried using the n_jobs
argument set to number of GPU devices I have but that still only worked sequentially.
Thanks!
Glad to hear that you make progress.
is there an easy way to implement multi-GPU hyperparameter search when using GridSearchCV or RandomizedSearchCV?
This is not possible yet. We have a ticket for that but it won't be trivial. n_jobs
in sklearn means CPU jobs, since it uses joblib under the hood. We will have to see how to achieve this with multiple GPUs instead (also, it is hard to test). If you find a solution or have an idea, feel free to suggest it there.