lstm
lstm copied to clipboard
Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour.
Applied bayesian optimization (whetlab.com) to lower the perplexity of the "small" model (13 full epochs).
that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?
It took overnight. I'm still letting it run, it's still exploring, so this number could possibly get better.
On Thu, Feb 26, 2015 at 11:43 AM, Soumith Chintala <[email protected]
wrote:
that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?
— Reply to this email directly or view it on GitHub https://github.com/wojzaremba/lstm/pull/5#issuecomment-76213420.
that is really cool. Just on a single GPU or are you running parallel jobs?
Running 10 g2.2xlarges in parallel. The job suggestions come in via a pull model over a REST API, so it is trivial to parallelize (no extra code required at all). 10 is on the small side of what we usually do when trying to break records.
On Thu, Feb 26, 2015 at 11:49 AM, Soumith Chintala <[email protected]
wrote:
that is really cool. Just on a single GPU or are you running parallel jobs?
— Reply to this email directly or view it on GitHub https://github.com/wojzaremba/lstm/pull/5#issuecomment-76214897.
is it possible to compare bayesian optimization vs a simple logistic regression or binary search? It is unclear how to quantify that. Also, can you tell us how much tuning of the bayesian optimizer's hyperparameters is required...
There's a bit of an explanation here: https://www.whetlab.com/technology/ The core engine is based on research you can read about here: http://arxiv.org/abs/1502.05700
Also, here is an earlier paper detailing going into a bit more depth on the original approach: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms
The last link has comparisons to other hyperparameter optimization approaches, like the Tree Parzen method and random grid search, which we outperform by a large margin.
I'm not sure how you would use logistic regression or binary search in this setting, since there a) aren't gradients readily available for the hyperparameters and b) some parameters are categorical, some integer, and some floating point. We handle all of these cases.
Wow that's impressive!
Maybe if you have a few processors free you could try parameter tuning the Adam optimizer - for a test problem?
Really nice stuff, thanks @alexbw
The modifications required to automatically tune the hyperparameters are minimal. Pasting here in case anyone wants to replicate the tuning process. You'll need a beta key, which I'd be glad to provide you.
If you're interested, there's two small changes required:
- Define the hyperparameters to tune, and get a suggestion
- Report the performance of that suggestion (negative perplexity, or a NaN, if it failed).
We treat a return value of NaN (0/0) as a constraint, meaning we learn how to avoid similar jobs that would produce a failure. For deep nets, this usually means a memory error or a segfault (e.g. can occur for weird batch sizes). We treat training time too far above an hour as a constraint in this example, because we want to train the best "fast" net. You could similarly train the best "small" net, providing constraints for models that won't fit on a smartphone, or won't provide real-time classification.
:+1: