TF_ContinualLearningViaSynapticIntelligence Loss normalization

May I ask why aren't you normalizing the cross_entropy loss across the batch before calculating the gradients in the following line:

cross_entropy = -tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) )

If I try to change it to a normalized version cross_entropy = tf.reduce_mean(-tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) )

I could see that small_omega_vars updates are very small (due to smaller gradients) and consequently resultant big_omge_var is also very small. This renders the model to drift a lot on the earlier tasks. I wonder if the authors mentioned anything about summing the gradients across the batch and not normalizing it?

Sep 13 '17 20:09 arslan-chaudhry

You are indeed correct, thanks for pointing it out! The correct version should be (specifying the reduction axes) cross_entropy = - tf.reduce_mean(tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) ,1) ,0) However this results in lower accuracy, ~89% on the first task instead of ~98%. I must have made some error, cause it doesn't make much sense. Maybe it's due to it using SGD instead of fancier optimizers?

Sep 14 '17 01:09 spiglerg

Note that in the paper they use Adam instead of SGD. I have slightly more complex code locally that actually computes the delta(t) weights changes, instead of simply replacing it with -eta*gradient (in the case of SGD). I will update the repository soon. :)

Sep 14 '17 01:09 spiglerg

So computing the delta(t) with -eta*gradient would only work for the SGD optimizer. For fancier optimizers, this delta(t) would equate to something else. Why don't you update the weights for the step and then calculate the delta(t). This way you wouldn't have to worry about which optimizer you are using. Hope this makes sense.

Sep 16 '17 15:09 arslan-chaudhry

Yep. I just wanted to do it automatically. The code I'm using on my compute computes the delta by adding dependencies and grouping a few different operations under the train op. Simply, it updates the weights on the dependency that the weights are first saved in a temporary variable, and then computed the delta with the dependency that the weights have been updated. I will show you the code in a few days -- I'm moving out of the country tomorrow and I don't have access to a good computer. ;)

Sep 16 '17 16:09 spiglerg

TF_ContinualLearningViaSynapticIntelligence TF_ContinualLearningViaSynapticIntelligence copied to clipboard

Loss normalization

TF_ContinualLearningViaSynapticIntelligence
TF_ContinualLearningViaSynapticIntelligence copied to clipboard