dask-examples
dask-examples copied to clipboard
initial skorch hyperparam opt implementation
Hey guys, I want to get the conversation started on this. I have a v1 implementation of an example using PyTorch + Skorch for a text classification problem. I'm then using Dask's Hyperband grid search algo to find the best hyperparameters. It ran successfully once and then I made some more changes and it's now failing with a fairly cryptic error message.
If you have some pointers, I can run edits. Meanwhile, I'll keep looking at it for potential bugs.
Separately, I'd love to get your thoughts on how to make better use of torchtext in the current pipeline. The way I'm preparing training data is causing a lot of extra compute and totally breaks the batching semantics of torchtext and deep learning models in general.
Check out this pull request on
Review Jupyter notebook visual diffs & provide feedback on notebooks.
Powered by ReviewNB
cc @stsievert
I fixed the typo you pointed out (thanks!) and noticed a couple other things along the way.
It turns out that I was running out of GPU memory because of the way that I was passing in pretrained embeddings (was using all of Glove instead of a 25k subset or vectors). So that's fixed now and memory utilization is staying much lower.
Another observation is that GPU memory utilization is monotonically increasing and I haven't been able to reduce it by deleting PyTorch or Skorch objects, which should garbage collect those objects and free memory. I'm wondering if this has something to do with working in a distributed environment, where deleting the object in the Jupyter Notebook doesn't delete references to GPU memory on the workers. When I keyboard interrupted the process, the workers got restarted and memory utilization dropped down to zero.
I'm also getting an error because my filter_sizes
parameter is getting registered as a multi-dimensional array when stored in the search.cv_results_ dictionary. I converted search.cv_results_['param_module__filter_sizes']
to a list before passing it to pandas and that cleared it up but not sure if there's something better that can be done under the hood.
The script now runs but it takes a long time on a single small-ish GPU (~60 minutes). I'm hoping to try this out on a GPU cluster soon. I suspect for big hyperparameter optimization jobs you'd want a fairly large cluster of GPUs (e.g. 4+) to get through these jobs in a reasonable amount of time, which does put up a bit of a barrier to entry for an example demo and any practitioners that can't afford that.
I could probably reduce the dataset size and the model would still converge. I'm just trying to create as "real" an example as possible.
Thanks for this use case. I've got some of the related fixes in https://github.com/dask/dask-ml/pull/671 (which will remain a draft until this PR is merged). Please comment in that PR with your questions and/or suggestions.
I'm just trying to create as "real" an example as possible.
I'd be careful with excessive computation. These examples run on Binder, which has pretty serious limits on computation. GPUs are definitely out of scope, and I hesitate to do any computation that takes more than ~10 seconds.
I've included cells like this before:
# Make sure the computation isn't too excessive for this simple example.
max_iter = 9
# max_iter = 243 # uncomment this line for more realistic usage
reduce the dataset size and the model would still converge
I typically don't look for performance comparisons in examples like this. I tend to leave that for papers/documentation. Instead, I tend to run these examples to figure out how to use the tool. To me, the most salient question this example answers are the following:
- How are Skorch models created from PyTorch models?
- How do I pass hyperparameters to Skorch models?
- How do I use a non-standard dataset with a Skorch model? What memory constraints does the dataset present, and how are they circumvented?
The last question might warrant another PR. The GPU usage is interesting and good to see. I'd definitely add a note saying this works on GPUs (and probably some code to put it on the GPU if available). Importantly, I don't think a GPU should be required.
Thanks for all the feedback so far. The example is coming together. I implemented the collate_fn
that we discussed above and things are now working well in Skorch (significantly faster when padding at the batch level).
The custom Dataloader
worked well with Skorch and I'm currently trying to make it work with Hyperband. I think I found a way around the variable length feature sizes - simply use the raw text as one single feature (i.e. "fixed size") in a Dask array and then let my collate function do the tokenization, padding, etc. This should seriously reduce the amount of computation performed (as it did in Skorch).
The crux of the issue is that Hyperband doesn't appear to be handling the validation data correctly (though frankly, I can't tell if it's handling my training data correctly either - not sure if this would be better in a .py where I might see more log output). It looks like Hyperband is passing my validation data through my collate_fn
twice. I say this because Hyperband is erroring out with KeyError: tensor([0.])
and if you look closely at the traceback, the key at this particular stage of the collate_fn
should be something like pos
or neg
, which are the unprocessed labels for my dataset. The output of collate_fn
should be a processed label of the form tensor([0.])
, hence why I think collate_fn
is being called twice on my validation data. Finally, I think is is a validation data issue (as opposed to training data) because this only seems to arise when return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
is called. I left the error in the notebook I most recently pushed.
Do you have any thoughts on why the handling of the validation data in Hyperband would differ from the training data? Separately, is there any reason to think that a DataLoader
wouldn't work here?
EDIT: do you think this has anything to do with skorch-dev/skorch#641?
Sorry for the delayed response. I'll have more time to respond a week from now.
Hyperband doesn't appear to be handling the validation data correctly
Could the issue be with out-of-vocabulary words? There might be a word in the validation set that's not in the training set. That's the first idea that comes to mind, especially because you're passing in list of strings. If that's the issue, use of HashingVectorizer would resolve it.
I can't tell if it's handling my training data correctly either
From what I can tell, you're handling the test data correctly: it appears you're only running the test data through the model once at the very end. How are you confused?
Do you have any thoughts on why the handling of the validation data in Hyperband would differ from the training data?
Yes, especially with text data. Do the train, validation and test sets all have the same vocabulary? If not, you could probably get it around with something like:
def pad_batch(batch, TEXT, LABEL):
text, label = list(zip(*batch))
text = [word for word in text if word in TRAIN_VOCAB]
# ... rest of function untouched
for some appropriately defined TRAIN_VOCAB
. Alternatively, you could use HashingVectorizer
as mentioned in https://github.com/dask/dask-examples/pull/149#discussion_r429566625.
I suspect this is coming into play with these lines:
train_dataloader = DataLoader(..., collate_fn=pad_batch_partial) # In[20]
test_dataloader = DataLoader(..., collate_fn=pad_batch_partial) # In[32]
I wouldn't use a grid search with Hyperband. I prefer random searches.
@stsievert, my apologies for the delay in picking this back up. The good news, we’re up and running!
After working with Skorch
a bit more, it became quite obvious to me why this wasn’t going to work as it was written before. In short, in Skorch
I was passing skorch_model.fit(train_dataset, y=None)
, while in HyperbandSearchCV
I was passing search.fit(X, y)
. In Skorch
, I am using torch.utils.data.Dataset
and torch.utils.data.DataLoader
, which aren't compatible with HyperbandSearchCV
for two reasons: 1) X
in skorch_model.fit(train_dataset, y=None)
contains both the features and the label in one tuple, while in HyperbandSearchCV
, I have to pass X
and y
explicitly, and 2) torch.utils.data.DataLoader
(with a collate_fn
) doesn’t accept dask arrays.
But hindsight is 20/20.
Most problems appear to stem from the need to use dask arrays.
In Skorch
, you have full control over how your data is formatted when fed to model.fit()
and full control over how data is preprocessed (e.g. tokenized, padded, etc.) before it is sent to the network. This flexibility is curtailed when you are required to use dask arrays AND prepare all of your data ahead of time.
Here are two options (as I see it) for using HyperbandSearchCV
with variable length features:
- You can put raw text in a dask array (i.e. a “fixed-length feature” since it’s a singular string). Then you need some sort of preprocessing to occur. That could happen in the model but it’s not good practice.
-
torch.nn.utils.rnn.pad_sequence
might be useful here, but you’d still need a tokenizer in the model and that’s just not a great design pattern.
-
- You can preprocess text outside of the model but then you must have a fixed-size feature set (i.e. pad to the longest sequence in your entire dataset) and this is what drives all the extra compute time.
- You can fix your sequence length, which I’ve done, to reduce computation times but accuracy does suffer. You’re also still running a lot of extra computation for shorter length sequences that were heavily padded.
Here’s how I’m thinking about the decision to use or not use HyperbandSearchCV
. If your grid search is large AND hyperband's algorithm is asymptotically faster than something like RandomizedSearchCV
(e.g. O(n^2) vs O(n^3)) then it makes sense to use HyperbandSearchCV
. The question is, where is that tipping point in terms of the size of parameter search? In my experiments, I’ve witnessed a 2-3x speedup by using proper deep learning batching semantics (see here). In other words, every batch that is fed through the network in Dask takes 2-3x longer than in vanilla PyTorch
or Skorch
. If you can overcome this slowdown by simply doing significantly less computation (i.e. less grid searching), then you come out ahead by using HyperbandSearchCV
. If your search space isn’t arbitrarily large, then you would likely come out ahead by using Skorch
+ RandomizedSearchCV
simply as a result of implementation details (i.e. not due to the asymptotic performance of RandomizedSearchCV
or HyperbandSearchCV
).
This analysis doesn’t rigorously address the time it takes for a model to converge under standard batching semantics (i.e. pad at the batch level) vs. padding to the longest example in the dataset. The model may take longer to converge and/or may not be able to achieve peak performance (e.g. accuracy, f1, etc.) by padding to the longest example in the dataset.
Sadly, I don’t know enough about Dask
’s or HyperbandSearchCV
’s internals to provide a well-reasoned recommendation. One possibility would be to consider how one might use HyperbandSearchCV
without dask arrays and instead use numpy arrays or torch tensors. Another thing to consider is the possibility to do some preprocessing (e.g. tokenization, etc.) after data is submitted to search.fit()
but before data makes it to the network.
I’m happy to discuss further and answer any questions that you have.
I’m also happy to run any edits on this example to polish it up but in terms of design patterns, I think we’ve explored most of the obvious options.
I have one more basic question:
- Why does model convergence depend on "pad[ding] at the batch level) vs. padding to the longest example in the dataset"? Is "convergence" in terms of optimization iterations?
I'd expect the model to converge at the same rate regardless of batching semantics. I'd expect the model have the identical output for identical inputs, regardless if the input is padding to the longest example in the batch or dataset. Why isn't that the case, or am I mis-understanding?
You can put raw text in a dask array (i.e. a “fixed-length feature” since it’s a singular string). Then you need some sort of preprocessing to occur. That could happen in the model but it’s not good practice.
I like this solution best because arrays of string are passed between workers. Why is this implementation bad practice?
class PreprocessInputs(skorch.NeuralNetClassifier):
def __init__(self, preprocessing, **kwargs):
self.preprocessing = preprocessing
super().__init__(**kwargs)
def partial_fit(self, X: np.ndarray, y=None):
X_processed = self.preprocess(torch.from_numpy(X))
return super().partial_fit(X_processed, y=y)
def preprocess(self, X):
with torch.no_grad():
return self.preprocess(X)
This implementation is more usable because the model is one atomic unit. That is, no outside knowledge is needed on specific methods to preprocess or normalize the input. We could fold this implementation into Dask-ML, but it'd basically be doing the same thing as this implementation.
- Why does model convergence depend on "pad[ding] at the batch level) vs. padding to the longest example in the dataset"? Is "convergence" in terms of optimization iterations?
It's a bit hand wavy but I've observed very different loss metrics when training on data padded at the batch level (lower loss scores) vs. data padded at the dataset level (higher loss scores). Yes, when I say convergence, I mean that the model is actually training effectively and improving with each iteration. I'd have to spend some more time running experiments to determine if a model trained on data padded to the longest example in the dataset would be able to achieve the same level of performance (e.g. accuracy, f1, etc.) as one trained with batch level padding.
I'd expect the model to converge at the same rate regardless of batching semantics. I'd expect the model have the identical output for identical inputs, regardless if the input is padding to the longest example in the batch or dataset. Why isn't that the case, or am I mis-understanding?
Suppose you've trained a classification deep learning model. Further, let's suppose you prepare one single example that you want a prediction for. If you pad that example, you will get one predicted probability distribution. If you don't pad that example, you will get a different probability distribution. Without more experimentation, the question still remains, how big is that difference?
I agree with you, padding should not play a huge role, in general, especially when you might only see a dozen or so pad tokens in a typical batch. However, in our case, we're talking about 1000s of unnecessary pad tokens. It probably will have some sort of impact on the predicted probability distributions.
I like this solution best because arrays of string are passed between workers. Why is this implementation bad practice? This implementation is more usable because the model is one atomic unit. That is, no outside knowledge is needed on specific methods to preprocess or normalize the input. We could fold this implementation into Dask-ML, but it'd basically be doing the same thing as this implementation.
There's nothing inherently wrong about it. It's just not the typical design pattern that you'll see in the wild, where people tend to experiment with model architectures much more frequently than they experiment with their preprocessing setup. For example, torchtext
is decoupled from modeling and Hugging Face follows this same pattern.
I implemented parts of that approach in a previous commit. We can pursue it further - I just worry about adoption of this pattern.
It's been a while since there was any activity here. @ToddMorrill is there any still interest in getting this into a mergeable state?