Optimisers.jl icon indicating copy to clipboard operation
Optimisers.jl copied to clipboard

Split out the `rules.jl` as a sub-package (or a separate package) ?

Open chengchingwen opened this issue 2 years ago • 9 comments

There are many different design on how the update function/api should be implemented, but the application rules of each optimizer instances are the same. I wonder if we can separate those rules as another package (e.g. OptimiserRules.jl?), so that we can test on other update implementation easily or enable the possibility of having multiple optimizer packages with different update api defined.

chengchingwen avatar Aug 29 '22 08:08 chengchingwen

This package still likes to think of itself as the new little thing moved outside Flux... it could be splintered further but it's a bit of a pain to manage too many small packages.

Am curious though what designs for an update API you have in mind? It's possible that what's here could still be simplified, and certain that it could allow more features.

mcabbott avatar Aug 29 '22 13:08 mcabbott

I don't really have any concrete idea right now, but it seems worth to split out that rule part. For example, if someone prefer the older IdDict based update implementation, that could be easily done without including the fmap based implementation.

chengchingwen avatar Aug 29 '22 15:08 chengchingwen

Ok. I guess my vote is that such a thing should just depend on this package, until it's clear that the burden of loading more code than it needs outweighs the burden of maintaining two packages here. This package isn't very long, and doesn't depend on many things.

FWIW, there is an IdDict implementation here: https://github.com/FluxML/Flux.jl/pull/2029/files#diff-0ebc416cc227b7a58c66c0f81e8119576f68219437353510efc2822dcbf849dfR56-R87 . It's a lot simpler than all this tree-walking stuff, I have to say...

mcabbott avatar Aug 29 '22 16:08 mcabbott

FWIW, there is an IdDict implementation ... It's a lot simpler than all this tree-walking stuff, I have to say...

I would say that's the reason why I think it worth splitting out. The current design support many features, inplace and out-of-place accumulation, higher order derivative, etc.. It's hard to design a good API to support all of them. But by restricting the number of support features, it's possible to find some better design for a specific use.

... until it's clear that the burden of loading more code than it needs outweighs the burden of maintaining two packages here

Just out of curious, what are the increased maintenance effort it would need if we just make it a sub-package here? If I understand correctly, Everything is the same except some files are put into another folder in this repo. And those rules don't seems to need lots of development/maintenance effort.

chengchingwen avatar Aug 29 '22 16:08 chengchingwen

what are the increased maintenance effort

More juggling of which packages are dev'd locally. More different version numbers, more ways for conflicts to occur. More waiting for registration of updates, when you need one to finish before another can be tagged. More careful thought about what function is in which package, and what the version bounds have to be when you move something from one to the other.

The current design support many features, inplace and out-of-place accumulation, higher order derivative, etc

Sure. Not sure how we got higher derivatives, IIRC they aren't used by any rules here & aren't even tested, no idea if the design is right. Old-style FluxML I guess...

But if you want to write a package exploring something more restricted, loading Optimisers.jl just as a source of rules does not imply any promise that you support all the same things.

I also think exploring better / alternative APIs isn't out-of-scope for this package. It's still quite young.

mcabbott avatar Aug 29 '22 19:08 mcabbott

Ok, that's fair.

Not sure how we got higher derivatives, IIRC they aren't used by any rules here & aren't even tested, no idea if the design is right. Old-style FluxML I guess...

I'm not sure either. I thought that's what #16 is about.

chengchingwen avatar Aug 30 '22 02:08 chengchingwen

Perhaps too snarky, sorry; the higher order case was before I paid this package any attention.

If that AdaHessian works then someone should at least add it to the tests. I guess it wants only the diagonal of the Hessian, which is good because that's all that can make sense with this package's tree-like view of the parameters.

mcabbott avatar Aug 30 '22 02:08 mcabbott

A big reason why AdaHessian and other optimizers which use the HVP/Hessian diagonal or an approximation thereof haven't been ported is because they rely on nested differentiation. If we can get that working for non-trivial models, then it should be much easier to test out those methods.

ToucheSir avatar Aug 30 '22 04:08 ToucheSir

Yeah that PR really should have added the example as a test, but yes, the higher-order stuff was only put in place to potentially support them in the future prior to releasing the package (at a time when Optimisers.jl had fewer types to dispatch on making deprecations harder). We have no higher-order rules, so we don't have it as a feature (yet).

darsnack avatar Aug 30 '22 20:08 darsnack