TF_ContinualLearningViaSynapticIntelligence
TF_ContinualLearningViaSynapticIntelligence copied to clipboard
Loss normalization
May I ask why aren't you normalizing the cross_entropy loss across the batch before calculating the gradients in the following line:
cross_entropy = -tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) )
If I try to change it to a normalized version
cross_entropy = tf.reduce_mean(-tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) )
I could see that small_omega_vars updates are very small (due to smaller gradients) and consequently resultant big_omge_var is also very small. This renders the model to drift a lot on the earlier tasks. I wonder if the authors mentioned anything about summing the gradients across the batch and not normalizing it?
You are indeed correct, thanks for pointing it out!
The correct version should be (specifying the reduction axes)
cross_entropy = - tf.reduce_mean(tf.reduce_sum( y_tgt*tf.log(y+1e-04) + (1.-y_tgt)*tf.log(1.-y+1e-04) ,1) ,0)
However this results in lower accuracy, ~89% on the first task instead of ~98%. I must have made some error, cause it doesn't make much sense. Maybe it's due to it using SGD instead of fancier optimizers?
Note that in the paper they use Adam instead of SGD. I have slightly more complex code locally that actually computes the delta(t) weights changes, instead of simply replacing it with -eta*gradient (in the case of SGD). I will update the repository soon. :)
So computing the delta(t) with -eta*gradient would only work for the SGD optimizer. For fancier optimizers, this delta(t) would equate to something else. Why don't you update the weights for the step and then calculate the delta(t). This way you wouldn't have to worry about which optimizer you are using. Hope this makes sense.
Yep. I just wanted to do it automatically. The code I'm using on my compute computes the delta by adding dependencies and grouping a few different operations under the train op. Simply, it updates the weights on the dependency that the weights are first saved in a temporary variable, and then computed the delta with the dependency that the weights have been updated. I will show you the code in a few days -- I'm moving out of the country tomorrow and I don't have access to a good computer. ;)