lstm Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour.

Applied bayesian optimization (whetlab.com) to lower the perplexity of the "small" model (13 full epochs).

Feb 26 '15 16:02 alexbw

that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?

Feb 26 '15 16:02 soumith

It took overnight. I'm still letting it run, it's still exploring, so this number could possibly get better.

On Thu, Feb 26, 2015 at 11:43 AM, Soumith Chintala <[email protected]

wrote:

that's nice! how long did it take to run the bayesian optimization and find the ideal hyperparameters?

— Reply to this email directly or view it on GitHub https://github.com/wojzaremba/lstm/pull/5#issuecomment-76213420.

Feb 26 '15 16:02 alexbw

that is really cool. Just on a single GPU or are you running parallel jobs?

Feb 26 '15 16:02 soumith

Running 10 g2.2xlarges in parallel. The job suggestions come in via a pull model over a REST API, so it is trivial to parallelize (no extra code required at all). 10 is on the small side of what we usually do when trying to break records.

On Thu, Feb 26, 2015 at 11:49 AM, Soumith Chintala <[email protected]

wrote:

that is really cool. Just on a single GPU or are you running parallel jobs?

— Reply to this email directly or view it on GitHub https://github.com/wojzaremba/lstm/pull/5#issuecomment-76214897.

Feb 26 '15 16:02 alexbw

is it possible to compare bayesian optimization vs a simple logistic regression or binary search? It is unclear how to quantify that. Also, can you tell us how much tuning of the bayesian optimizer's hyperparameters is required...

Feb 26 '15 17:02 soumith

There's a bit of an explanation here: https://www.whetlab.com/technology/ The core engine is based on research you can read about here: http://arxiv.org/abs/1502.05700

Also, here is an earlier paper detailing going into a bit more depth on the original approach: http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms

The last link has comparisons to other hyperparameter optimization approaches, like the Tree Parzen method and random grid search, which we outperform by a large margin.

I'm not sure how you would use logistic regression or binary search in this setting, since there a) aren't gradients readily available for the hyperparameters and b) some parameters are categorical, some integer, and some floating point. We handle all of these cases.

Feb 26 '15 17:02 alexbw

Wow that's impressive!

Maybe if you have a few processors free you could try parameter tuning the Adam optimizer - for a test problem?

[Feature request] A suitable test problem for Adam

Feb 26 '15 17:02 ajaytalati

Really nice stuff, thanks @alexbw

Feb 26 '15 18:02 ajtulloch

The modifications required to automatically tune the hyperparameters are minimal. Pasting here in case anyone wants to replicate the tuning process. You'll need a beta key, which I'd be glad to provide you.

If you're interested, there's two small changes required:

We treat a return value of NaN (0/0) as a constraint, meaning we learn how to avoid similar jobs that would produce a failure. For deep nets, this usually means a memory error or a segfault (e.g. can occur for weird batch sizes). We treat training time too far above an hour as a constraint in this example, because we want to train the best "fast" net. You could similarly train the best "small" net, providing constraints for models that won't fit on a smartphone, or won't provide real-time classification.

Feb 26 '15 19:02 alexbw

:+1:

Feb 26 '15 19:02 ajtulloch

lstm lstm copied to clipboard

Optimized hyperparameters on "small" model, achieves 87 perplexity in 1 hour.

lstm
lstm copied to clipboard