RAM
RAM copied to clipboard
question for the gradient and pertrain
at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.
thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.
@hhhmoan Hi! Thank you very much for spending time reading the code and pointing out mysterious aspects!
The code actually works WITHOUT the pretraining. I was just a little bit curious to see if pretraining can improve anything. By the way, I just realized that the implementation for the pretraining part incomplete (it is not training the location network)...
On the other hand, I totally understand your suggestion about the stop_gradient for the location output. I will do some testing about that! We are still actively working on debugging. I will report our attempts to replicate the original results by Mnih et al. ASAP!
Thank you very much for your suggestion again!
^_^. To say the truth, i failed to implement this paper in translate clutter mnist data in tensorflow.if you success and release your result,it can really help me.
my best wish to you work
I'm glad to hear that! We are working on it!
there are some ops in RAM is not differentiable, and in the paper, use the reward to replace the gradient so that we can do the BP.But in the code, your reward is in the loss, so the reward just like a normal loss and tf will compute the gradient include the ops which is not differentiable. i see that some tf.stop_gradient you use. is that mean you just stop the gradient of these ops? if it is true ,i dont know why the model work,because the params in the LocNet is frozen.
@Lzc6996 You are right! Our StopGrad operation is incorrect and we still haven't figure out how to resolve this issue.
We failed to realize that because the network still "works". It might be the case that other layers can adapt (by tuning their parameters) the terrible parameters in the LocNet.
Qihong, have you work out how to update the parameters in LocNet yet?
Sorry, not yet. Our code is incorrect for some more fundamental reasons. I am still not exactly sure how to fix it. You can take a look at this repo: https://github.com/zhongwen/RAM
This RAM implementation beats ours on the 28x28 standard MNIST.
Thanks for your favorable and prompt reply. And do you think the implementation that you mentioned above realize the parameters updates as Mnih's paper describes?
@JasonZhao001 I am not sure. I plan to replicate Minh's results with that implementation. I would be more certain if the replication is successful.
@JasonZhao001 And I also plan to visualize that implementation with tensor board.
Yeah, I found that the work done by Zhongwen drops check point and summary(tensorboard), its will be helpful if they are added.
And there are a question that still confuses me when I try to work on your codes :
The "DRAW WINDOW" functions don't work on my machine, even if I have set the control parameters to True. I have put a "print" in the animate block and it prints as I set when training but still no windows shows. I wonder its the problem only in my case, so can you tell me if it works in your platform now?
@JasonZhao001 That's strange. "draw" should work when you set draw to 1. Can you send me the error message? Thanks!
There is no error message but just don't show the window as it should be. So I doubt it may because of my platform's problem.
@JasonZhao001 I see. Let me know if you get more clues about what is going on. I am more than happy to help!
@QihongL found the reason. It's because that the matplotlib has something wrong in my platform, when I $sudo pip uninstall matplotlib, it worked! I may be because I had install two versions of matplotlib and when I installed the second one, I set to ignore the excited one. Thanks a lot!
@JasonZhao001 Great!
@QihongL I have found the error in your gradient implementation. The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential. But when you define the loss function:

you back-propagate the gradient of the loss in the computation graph of which mean_loc is the part of, hence you calculate the gradient w.r.t mean_loc. You don't calculate the gradient of the loss w.r.t samp_loc.
EDIT So, if you comment the line:
mean_loc = tf.stop_gradient(mean_loc)
and keep the line:
sample_loc = tf.stop_gradient(sample_loc)
things should work. Let me know if it works for you.
@GodOfProbability Hi! Thank you so much for pointing this out!
I also think this is causing the trouble. I think I tried commenting that line out before and it didn't work. I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).
@QihongL I did an experiment on some toy example, and stopping only the gradient from the sample_loc improves the performance and not the other way around.
I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).
I think it will not propagate the gradient over time because only sample_loc interact with the next time step, not the mean_loc. Hence if you stop gradient w.r.t sample_loc, it is sufficient to stop the "bad" gradient from flowing across time. Furthermore, a thing to keep in mind is that mean_loc are different objects predicted at every time step and you start the back-propagation by finding the gradient w.r.t mean_loc, hence there is nothing coming from the non-differentiable part.
@GodOfProbability That's very interesting... I will try that! Thank you very much for the suggestion! I will let you know what I find out!
@GodOfProbability @QihongL We make an assumption that we don't sampling but use the mean_loc straightly, just similar with the soft attention in "show, attend and tell" Then, the question are:
- Do you think it will work well?
- Of course, gradient of mean_loc will flow across time in this case. Then, this kind of gradient would be "bad" just as you said?
"bad" gradient from flowing across time.then the gradients flow over time (across different glimpses).
@JasonZhao001 If you stop gradient at sample_loc, the bad_gradients will not flow, because only sample_loc interacts across time, and if you stop gradient flowing from sample_loc, you are actually stopping gradient (across time) to flow through mean_loc. However, there are gradients coming from the loss function that corresponds to reward function, wich should flow through mean_loc. The gradient should flow through mean_loc from sample_loc (this gradient comes from the differentiation of the monte carlo approximation of the gradient of the reward function.) If time permits, you should do the experiments, and let us know.
@GodOfProbability Yes, you are right! The parameters at location generation module rely on the the derivative of log[P(sample_loc|mean_loc,sigma)] w.r.t. parameters_loc to update, which is actually the derivative of mean_loc w.r.t. parameters_loc. And I will do experiments on it later, and I will try the assumption as well. And will report my results then.
@QihongL @GodOfProbability @jlindsey15 @Lzc6996 It proves working well when I comment the line mean_loc = tf.stop_gradient(mean_loc) as Gopal described above. When using the bandwidth = 12, it converge at more than 96% accuracy 600k time step (early stop). And I'm sure that it will get Mnih's result by tuning some parameters when training. By the way, if you stop_gradient at mean_loc, it shows in tensorboard that the parameters here never update when training. And I have a possible reason why it still works in that implementation. It is because that the attention window with three scales especially the 2ed and 3rd one could cover enough information (48 w.r.t. 60), meanwhile, the bandwidth is large enough to recognize a little lower resolution from the 2ed and 3rd scale (1/2 and 1/4 times respectively). So it result relies on the fully connected layers for classification and it is actually the same with the fixed location glimpse to recognition. You can have a try by using only two scales (e.g. 12 and 24) or make the bandwidth smaller (e.g. 8 or 6), then it will not work so well. It is same for the work in https://github.com/zhongwen/RAM Moreover, if you fix the problem, when make the bandwidth smaller (e.g. 8), it will perform better with higher accuracy and faster convergence. I did the experiment that it will converge to 97% at 400k time step! (So I make an early stop) Furthermore, I found it does not apply the M times samples in this original implementation. I plan to try it and any suggestions from you will help it a lot! If I succeed, I will share the sourse code as well. Thanks!
@JasonZhao001 @GodOfProbability Hi Jason, I read your comment very interestingly, I also solve the stop_gradient problem, but cannot achieve high performance as like you (in my case, about 94%, translated case) Can I know your hyperparameters and learning strategies? Moreover, except for mean_loc issue, do you think that the baseline in this code is correctly implemented? I think that baseline should be also learnable.. plz give me your opinion! thx!
@jtkim-kaist The baseline tech is very important to location prediction. It is learnable as an extra term of cost function as is shown in the source code below:
J = J - tf.reduce_sum(tf.square(R - b), 1)
Note that the parameters at baseline part is learnt separately with the other to parts. Of course, I have modified some of the hyperparas to make it work better, but I know its not the best, I'm still trying. If I succeed, I will post my implementation later.
@JasonZhao001 Thank your for your kind comment,
I also agree with you. However, in this code, the baseline is implemented like below,
baseline = tf.sigmoid(tf.matmul(hiddensState, Wb_h_b)+Bb_h_b)
and, both Wb_h_b and Bb_h_b seems that they can't learn due to stop_gradient function.
when stop_gradient function is off, the baseline depends on hidden_state so that it seems not right as you said (Note that the parameters at baseline part is learnt separately with the other to parts.)
So, I think, "baseline = tf.sigmoid(some independent variable to model)" is more appropriate.
Please give me your opinion thx! (I'm also in proceed)
@jtkim-kaist
b shouldn't be updated from this term:
J = tf.concat(1, [tf.log(p_y + SMALL_NUM) * (onehot_labels_placeholder), tf.log(p_loc + SMALL_NUM) * (R - no_grad_b)])
where no_grad_b = tf.stop_gradient(b) can prevent it to update.
While b is updated via this term:
J = J - tf.reduce_sum(tf.square(R - b), 1)
where b is not stopped gradient.
@JasonZhao001 Thank you! I missed that part.
I'll expect your implementation have a good day!
@jtkim-kaist You are welcome :)