horde-ad
horde-ad copied to clipboard
[Postponed] In applications without mini-batches, work around costly parameter updates
This is not urgent and, in fact, it's likely to complicate the code a lot without any benefit for neural network applications. Revisit when we have convincing other applications where it matters.
We can probably gain performance if we eliminate the explicit step of adjusting parameters by gradients and, instead, we immediately multiply the increments by
gamma
and subtract, whenever we update the parameters in theVar
cases ofeval
. In other words, we don't construct gradients at all, but instead gradually construct the new parameters, starting with the old ones. That may be what library ad is doing in itsgradWith combine (f input) parameters
calls. However, for this we need to implement Adam and other gradient descent schemes first, because already ourgdSmart
gradient descent operation uses both old and new values of gradients. Probably it could use only the new values applied to parameters, but other schemes may be less forgiving.This approach involves one more multiplication whenever a parameter is adjusted, which would be almost free, if not for the implementation detail that it incurs also one more allocation, the way it's currently done in hmatrix. Low-level FFI work would be needed to fix that.
This all may be moot once you do mini-batches, as is customary in machine learnings in order to expose parallelism for GPU. With mini-batches parameter updates are once per hundreds of gradient computations. However, in niches where gradients computed at the scalar level matter, updating parameters may remain the bottleneck.