axon
axon copied to clipboard
Add Eli & Rishi's correlation-based learning mechanism
http://arxiv.org/abs/2011.07334 -- the key equation is: (Var_i + Var_j - 2 Covar_ij) -- optimize variance of sender and receiver and minimize covariance between the two.
In my experiments, I updated the SWt (structural, spine, slow) weight in the slower outer-loop cycle as a function of accumulated Var and Covar stats (computed using simple running-average act - mean values) -- this produces a graded form of pruning-like function, because SWt multiplies the regular "fast" learned weights, so when it is reduced toward 0, it produces an effective "soft" form of pruning.
Having worked through the logic here better, I realized that I had an error in the initial implementation: missed the factor of 2 on Covar_ij and also that the pruning logic would make more sense to only include the negative component of this value -- otherwise we're getting a hebbian-like variance increasing force that is constantly working to increase the weights. That is not present in the pruning version.
Looks like using both positive and negative at .1 learning rate works well in large-scale lvis object recognition model -- significantly reduces the strength of top-5 PCA components while driving solid "n strong" PCA components throughout learning. Still need to fix output layer dynamics, but decoding shows continued learning throughout!