horde-ad icon indicating copy to clipboard operation
horde-ad copied to clipboard

[Postponed] In applications without mini-batches, work around costly parameter updates

Open Mikolaj opened this issue 3 years ago • 0 comments

This is not urgent and, in fact, it's likely to complicate the code a lot without any benefit for neural network applications. Revisit when we have convincing other applications where it matters.

We can probably gain performance if we eliminate the explicit step of adjusting parameters by gradients and, instead, we immediately multiply the increments by gamma and subtract, whenever we update the parameters in the Var cases of eval. In other words, we don't construct gradients at all, but instead gradually construct the new parameters, starting with the old ones. That may be what library ad is doing in its gradWith combine (f input) parameters calls. However, for this we need to implement Adam and other gradient descent schemes first, because already our gdSmart gradient descent operation uses both old and new values of gradients. Probably it could use only the new values applied to parameters, but other schemes may be less forgiving.

This approach involves one more multiplication whenever a parameter is adjusted, which would be almost free, if not for the implementation detail that it incurs also one more allocation, the way it's currently done in hmatrix. Low-level FFI work would be needed to fix that.

This all may be moot once you do mini-batches, as is customary in machine learnings in order to expose parallelism for GPU. With mini-batches parameter updates are once per hundreds of gradient computations. However, in niches where gradients computed at the scalar level matter, updating parameters may remain the bottleneck.

Mikolaj avatar Feb 12 '22 00:02 Mikolaj