torch icon indicating copy to clipboard operation
torch copied to clipboard

optimizers wishlist

Open skeydan opened this issue 5 years ago • 18 comments

Adam (unsurprisingly I guess)

thanks!

skeydan avatar Jun 30 '20 08:06 skeydan

adding LBFGS just out of enthusiasm ;-) (not using anywhere yet but suspect it could be real good for some types of data)

skeydan avatar Jul 01 '20 14:07 skeydan

Can I add LBFGS or is anyone already working on this?

dirkschumacher avatar Oct 24 '20 19:10 dirkschumacher

Why don't you use the C++ implementations of the optimizers?

dirkschumacher avatar Oct 24 '20 19:10 dirkschumacher

I am not working on LBFGS. Pinging @krzjoa as he worked on many other implementations.

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are

  • many of them are simple enough to be reimplemented.
  • the C++ implementation doesn't support parameter groups.

That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

dfalbel avatar Oct 24 '20 20:10 dfalbel

I haven't started working on LBFGS yet, so feel free to implement it, @dirkschumacher :wink:

krzjoa avatar Oct 24 '20 20:10 krzjoa

Ok, given the complexity of LBFGS I would rather use an existing implementation than rewriting it in R. I will take a look and come back with questions :)

dirkschumacher avatar Oct 24 '20 20:10 dirkschumacher

In this place I would like to propose an initial road map of torch optimizers. In my opinion, our first goal could be to implement all optimizers , that are curently present in PyTorch and Keras.

PyTorch optimizers:

  • [X] Adadelta
  • [X] Adagrad
  • [X] Adam
  • [ ] AdamW
  • [ ] SparseAdam
  • [ ] Adamax
  • [X] ASGD
  • [x] LBFGS
  • [X] RMSprop
  • [X] Rprop
  • [X] SGD

Keras optimizers (non occurring in PyTorch):

  • [ ] Nadam
  • [ ] Ftrl

Additionaly, there is a couple of MXNet optimizers which didn't appear above.

Some other fancier optimizers could be implemented in a separate library like pytorch-optimizer.

krzjoa avatar Oct 28 '20 21:10 krzjoa

Maybe it would make sense to extract the optimizers into a different package anyways and reexport them in the torchpackage.

dirkschumacher avatar Oct 29 '20 16:10 dirkschumacher

I understand your intention, but personally I would not recommend that (i.e. extracting and reexporting whole optim module). Optimizers cannot work without torch, so we'd create a cyclic graph of dependencies. 😉

krzjoa avatar Oct 29 '20 17:10 krzjoa

Yeah, I agree. Cyclic dependencies are considered harmful :)

dirkschumacher avatar Oct 29 '20 19:10 dirkschumacher

Sounds good to me! +1 for a new package for fancier optimizers! that would be really cool

dfalbel avatar Oct 29 '20 22:10 dfalbel

Hi @dfalbel ,

The main reason for not using the C++ implementations was that I wanted to allow extensions in the R side, and It would tricky to call custom R functions if using the C++ class. Other reasons are

many of them are simple enough to be reimplemented.
the C++ implementation doesn't support parameter groups.

That said, we can have optimizers that instantiate the C++ object when initializing and then having the step method calling the C++ step method.

Would it be possible to have optimizers that use the C++ objects? For example Adam or AdamW which are both available in libtorch. I believe this could speed up my training quite a bit since it seems that the optimizer$step() is a bottleneck compared to pytorch. Also if I'm reading the cpp docs correctly it seems C++ now supports parameter groups.

egillax avatar Jun 09 '22 10:06 egillax

Yes, In theory that's possible. I'll draft something in that direction and post here.

Ideally I'd really want to figure what's slowing down optimizers in R compared to PyTorch, because being able to inherit from other optimizers and etc is really useful for research and verification. Still I don't see what could be causing this, maybe the way we keep state? Anyway, I think having those C++ based optimizers is good.

dfalbel avatar Jun 09 '22 14:06 dfalbel

@egillax Here's a POC binding directly to the C++ LibTorch optimizers:

https://github.com/dfalbel/torchoptx

I didn't benchmark at all, but curious to see if that's much faster than the R based optimizers. Currently only SGD and Adam are supported, but shouldn't be a lot of work to add the others. We would also need to figure out how to support serialization.

dfalbel avatar Jun 09 '22 18:06 dfalbel

Hi @dfalbel,

That was quick! But I can't open the repo, is it by any chance private?

egillax avatar Jun 09 '22 19:06 egillax

Ohh sorry! Just made it public.

dfalbel avatar Jun 09 '22 19:06 dfalbel

@dfalbel some preliminary results for a small ResNet run for 20 epochs on the same random data

With torch R Adam optimizer:

Average time per epoch was: 0.854 secs

With the torchoptx c++ Adam

Average time per epoch was: 0.513 secs

And with pytorch

Average time per epoch: 0.403 seconds

So it's quite a bit faster than before and closer to pytorch! And it's a drop in replacement.

I'll test it more tomorrow, I have a Transformer model which showed larger differences between pytorch and torch in R. I'll also post the code I use for testing if anyone is curious.

egillax avatar Jun 09 '22 20:06 egillax

@egillax Nice, this sounds promising! I think we should be able to achieve very similar from R - specially on GPU where the operations are non-blocking.

This will probably make more difference the more parameters the network has though.

dfalbel avatar Jun 09 '22 22:06 dfalbel