Optimisers.jl
Optimisers.jl copied to clipboard
Split out the `rules.jl` as a sub-package (or a separate package) ?
There are many different design on how the update function/api should be implemented, but the application rules of each optimizer instances are the same. I wonder if we can separate those rules as another package (e.g. OptimiserRules.jl
?), so that we can test on other update implementation easily or enable the possibility of having multiple optimizer packages with different update api defined.
This package still likes to think of itself as the new little thing moved outside Flux... it could be splintered further but it's a bit of a pain to manage too many small packages.
Am curious though what designs for an update API you have in mind? It's possible that what's here could still be simplified, and certain that it could allow more features.
I don't really have any concrete idea right now, but it seems worth to split out that rule part. For example, if someone prefer the older IdDict
based update implementation, that could be easily done without including the fmap
based implementation.
Ok. I guess my vote is that such a thing should just depend on this package, until it's clear that the burden of loading more code than it needs outweighs the burden of maintaining two packages here. This package isn't very long, and doesn't depend on many things.
FWIW, there is an IdDict implementation here: https://github.com/FluxML/Flux.jl/pull/2029/files#diff-0ebc416cc227b7a58c66c0f81e8119576f68219437353510efc2822dcbf849dfR56-R87 . It's a lot simpler than all this tree-walking stuff, I have to say...
FWIW, there is an IdDict implementation ... It's a lot simpler than all this tree-walking stuff, I have to say...
I would say that's the reason why I think it worth splitting out. The current design support many features, inplace and out-of-place accumulation, higher order derivative, etc.. It's hard to design a good API to support all of them. But by restricting the number of support features, it's possible to find some better design for a specific use.
... until it's clear that the burden of loading more code than it needs outweighs the burden of maintaining two packages here
Just out of curious, what are the increased maintenance effort it would need if we just make it a sub-package here? If I understand correctly, Everything is the same except some files are put into another folder in this repo. And those rules don't seems to need lots of development/maintenance effort.
what are the increased maintenance effort
More juggling of which packages are dev
'd locally. More different version numbers, more ways for conflicts to occur. More waiting for registration of updates, when you need one to finish before another can be tagged. More careful thought about what function is in which package, and what the version bounds have to be when you move something from one to the other.
The current design support many features, inplace and out-of-place accumulation, higher order derivative, etc
Sure. Not sure how we got higher derivatives, IIRC they aren't used by any rules here & aren't even tested, no idea if the design is right. Old-style FluxML I guess...
But if you want to write a package exploring something more restricted, loading Optimisers.jl just as a source of rules does not imply any promise that you support all the same things.
I also think exploring better / alternative APIs isn't out-of-scope for this package. It's still quite young.
Ok, that's fair.
Not sure how we got higher derivatives, IIRC they aren't used by any rules here & aren't even tested, no idea if the design is right. Old-style FluxML I guess...
I'm not sure either. I thought that's what #16 is about.
Perhaps too snarky, sorry; the higher order case was before I paid this package any attention.
If that AdaHessian
works then someone should at least add it to the tests. I guess it wants only the diagonal of the Hessian, which is good because that's all that can make sense with this package's tree-like view of the parameters.
A big reason why AdaHessian and other optimizers which use the HVP/Hessian diagonal or an approximation thereof haven't been ported is because they rely on nested differentiation. If we can get that working for non-trivial models, then it should be much easier to test out those methods.
Yeah that PR really should have added the example as a test, but yes, the higher-order stuff was only put in place to potentially support them in the future prior to releasing the package (at a time when Optimisers.jl had fewer types to dispatch on making deprecations harder). We have no higher-order rules, so we don't have it as a feature (yet).