torch optimizers wishlist

Adam (unsurprisingly I guess)

thanks!

Jun 30 '20 08:06 skeydan

adding LBFGS just out of enthusiasm ;-) (not using anywhere yet but suspect it could be real good for some types of data)

Jul 01 '20 14:07 skeydan

Can I add LBFGS or is anyone already working on this?

Oct 24 '20 19:10 dirkschumacher

Why don't you use the C++ implementations of the optimizers?

Oct 24 '20 19:10 dirkschumacher

I am not working on LBFGS. Pinging @krzjoa as he worked on many other implementations.

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are

many of them are simple enough to be reimplemented.
the C++ implementation doesn't support parameter groups.

That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

Oct 24 '20 20:10 dfalbel

I haven't started working on LBFGS yet, so feel free to implement it, @dirkschumacher :wink:

Oct 24 '20 20:10 krzjoa

Ok, given the complexity of LBFGS I would rather use an existing implementation than rewriting it in R. I will take a look and come back with questions :)

Oct 24 '20 20:10 dirkschumacher

In this place I would like to propose an initial road map of torch optimizers. In my opinion, our first goal could be to implement all optimizers , that are curently present in PyTorch and Keras.

PyTorch optimizers:

[X] Adadelta
[X] Adagrad
[X] Adam
[ ] AdamW
[ ] SparseAdam
[ ] Adamax
[X] ASGD
[x] LBFGS
[X] RMSprop
[X] Rprop
[X] SGD

Keras optimizers (non occurring in PyTorch):

[ ] Nadam
[ ] Ftrl

Additionaly, there is a couple of MXNet optimizers which didn't appear above.

Some other fancier optimizers could be implemented in a separate library like pytorch-optimizer.

Oct 28 '20 21:10 krzjoa

Maybe it would make sense to extract the optimizers into a different package anyways and reexport them in the torchpackage.

Oct 29 '20 16:10 dirkschumacher

I understand your intention, but personally I would not recommend that (i.e. extracting and reexporting whole optim module). Optimizers cannot work without torch, so we'd create a cyclic graph of dependencies. 😉

Oct 29 '20 17:10 krzjoa

Yeah, I agree. Cyclic dependencies are considered harmful :)

Oct 29 '20 19:10 dirkschumacher

Sounds good to me! +1 for a new package for fancier optimizers! that would be really cool

Oct 29 '20 22:10 dfalbel

Hi @dfalbel ,

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are
many of them are simple enough to be reimplemented.
the C++ implementation doesn't support parameter groups.
That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

Would it be possible to have optimizers that use the C++ objects? For example Adam or AdamW which are both available in libtorch. I believe this could speed up my training quite a bit since it seems that the optimizer$step() is a bottleneck compared to pytorch. Also if I'm reading the cpp docs correctly it seems C++ now supports parameter groups.

Jun 09 '22 10:06 egillax

Yes, In theory that's possible. I'll draft something in that direction and post here.

Ideally I'd really want to figure what's slowing down optimizers in R compared to PyTorch, because being able to inherit from other optimizers and etc is really useful for research and verification. Still I don't see what could be causing this, maybe the way we keep state? Anyway, I think having those C++ based optimizers is good.

Jun 09 '22 14:06 dfalbel

@egillax Here's a POC binding directly to the C++ LibTorch optimizers:

https://github.com/dfalbel/torchoptx

I didn't benchmark at all, but curious to see if that's much faster than the R based optimizers. Currently only SGD and Adam are supported, but shouldn't be a lot of work to add the others. We would also need to figure out how to support serialization.

Jun 09 '22 18:06 dfalbel

Hi @dfalbel,

That was quick! But I can't open the repo, is it by any chance private?

Jun 09 '22 19:06 egillax

Ohh sorry! Just made it public.

Jun 09 '22 19:06 dfalbel

@dfalbel some preliminary results for a small ResNet run for 20 epochs on the same random data

With torch R Adam optimizer:

Average time per epoch was: 0.854 secs

With the torchoptx c++ Adam

Average time per epoch was: 0.513 secs

And with pytorch

Average time per epoch: 0.403 seconds

So it's quite a bit faster than before and closer to pytorch! And it's a drop in replacement.

I'll test it more tomorrow, I have a Transformer model which showed larger differences between pytorch and torch in R. I'll also post the code I use for testing if anyone is curious.

Jun 09 '22 20:06 egillax

@egillax Nice, this sounds promising! I think we should be able to achieve very similar from R - specially on GPU where the operations are non-blocking.

This will probably make more difference the more parameters the network has though.

Jun 09 '22 22:06 dfalbel

torch torch copied to clipboard

optimizers wishlist

torch
torch copied to clipboard