RAM icon indicating copy to clipboard operation
RAM copied to clipboard

question for the gradient and pertrain

Open hhhmoan opened this issue 9 years ago • 58 comments
trafficstars

at first,in function calc_reward, when you calc the J,you use p_loc made by mean_locs and sample_locs, but both the mean_locs and the sample_locs are stop_gradients. so I think tf.log(p_loc + SMALL_NUM) * (R - no_grad_b) is no use when calc the gradients. and why this need to use pretrain.but in paper,i never found this method.

thanks for you release your code,can you solve my doubts, and have you finish this experiment in translate clutter mnist data 100 * 100. if you have,please @me. thanks.

hhhmoan avatar Aug 27 '16 02:08 hhhmoan

@hhhmoan Hi! Thank you very much for spending time reading the code and pointing out mysterious aspects!

The code actually works WITHOUT the pretraining. I was just a little bit curious to see if pretraining can improve anything. By the way, I just realized that the implementation for the pretraining part incomplete (it is not training the location network)...

On the other hand, I totally understand your suggestion about the stop_gradient for the location output. I will do some testing about that! We are still actively working on debugging. I will report our attempts to replicate the original results by Mnih et al. ASAP!

Thank you very much for your suggestion again!

qihongl avatar Aug 27 '16 14:08 qihongl

^_^. To say the truth, i failed to implement this paper in translate clutter mnist data in tensorflow.if you success and release your result,it can really help me.

my best wish to you work

hhhmoan avatar Aug 29 '16 02:08 hhhmoan

I'm glad to hear that! We are working on it!

qihongl avatar Aug 29 '16 05:08 qihongl

there are some ops in RAM is not differentiable, and in the paper, use the reward to replace the gradient so that we can do the BP.But in the code, your reward is in the loss, so the reward just like a normal loss and tf will compute the gradient include the ops which is not differentiable. i see that some tf.stop_gradient you use. is that mean you just stop the gradient of these ops? if it is true ,i dont know why the model work,because the params in the LocNet is frozen.

Lzc6996 avatar Nov 25 '16 06:11 Lzc6996

@Lzc6996 You are right! Our StopGrad operation is incorrect and we still haven't figure out how to resolve this issue.

We failed to realize that because the network still "works". It might be the case that other layers can adapt (by tuning their parameters) the terrible parameters in the LocNet.

qihongl avatar Nov 29 '16 19:11 qihongl

Qihong, have you work out how to update the parameters in LocNet yet?

JasonZhao001 avatar Jan 16 '17 15:01 JasonZhao001

Sorry, not yet. Our code is incorrect for some more fundamental reasons. I am still not exactly sure how to fix it. You can take a look at this repo: https://github.com/zhongwen/RAM

This RAM implementation beats ours on the 28x28 standard MNIST.

qihongl avatar Jan 16 '17 16:01 qihongl

Thanks for your favorable and prompt reply. And do you think the implementation that you mentioned above realize the parameters updates as Mnih's paper describes?

JasonZhao001 avatar Jan 16 '17 16:01 JasonZhao001

@JasonZhao001 I am not sure. I plan to replicate Minh's results with that implementation. I would be more certain if the replication is successful.

qihongl avatar Jan 17 '17 18:01 qihongl

@JasonZhao001 And I also plan to visualize that implementation with tensor board.

qihongl avatar Jan 17 '17 18:01 qihongl

Yeah, I found that the work done by Zhongwen drops check point and summary(tensorboard), its will be helpful if they are added.
And there are a question that still confuses me when I try to work on your codes : The "DRAW WINDOW" functions don't work on my machine, even if I have set the control parameters to True. I have put a "print" in the animate block and it prints as I set when training but still no windows shows. I wonder its the problem only in my case, so can you tell me if it works in your platform now?

JasonZhao001 avatar Jan 18 '17 03:01 JasonZhao001

@JasonZhao001 That's strange. "draw" should work when you set draw to 1. Can you send me the error message? Thanks!

qihongl avatar Jan 18 '17 14:01 qihongl

There is no error message but just don't show the window as it should be. So I doubt it may because of my platform's problem.

JasonZhao001 avatar Jan 18 '17 15:01 JasonZhao001

@JasonZhao001 I see. Let me know if you get more clues about what is going on. I am more than happy to help!

qihongl avatar Jan 18 '17 19:01 qihongl

@QihongL found the reason. It's because that the matplotlib has something wrong in my platform, when I $sudo pip uninstall matplotlib, it worked! I may be because I had install two versions of matplotlib and when I installed the second one, I set to ignore the excited one. Thanks a lot!

JasonZhao001 avatar Jan 20 '17 14:01 JasonZhao001

@JasonZhao001 Great!

qihongl avatar Jan 20 '17 18:01 qihongl

@QihongL I have found the error in your gradient implementation. The gradient should flow only via mean_loc not from samp_loc, because the samp_loc gives you the location in the input image from where you should be sampling the next image, and hence become non-differential. But when you define the loss function:

codecogseqn

you back-propagate the gradient of the loss in the computation graph of which mean_loc is the part of, hence you calculate the gradient w.r.t mean_loc. You don't calculate the gradient of the loss w.r.t samp_loc.

EDIT So, if you comment the line:

mean_loc = tf.stop_gradient(mean_loc)

and keep the line:

sample_loc = tf.stop_gradient(sample_loc)

things should work. Let me know if it works for you.

Hippogriff avatar Feb 03 '17 16:02 Hippogriff

@GodOfProbability Hi! Thank you so much for pointing this out!

I also think this is causing the trouble. I think I tried commenting that line out before and it didn't work. I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).

qihongl avatar Feb 05 '17 19:02 qihongl

@QihongL I did an experiment on some toy example, and stopping only the gradient from the sample_loc improves the performance and not the other way around.

I guess if I don't stop_grad it, then the gradients flow over time (across different glimpses).

I think it will not propagate the gradient over time because only sample_loc interact with the next time step, not the mean_loc. Hence if you stop gradient w.r.t sample_loc, it is sufficient to stop the "bad" gradient from flowing across time. Furthermore, a thing to keep in mind is that mean_loc are different objects predicted at every time step and you start the back-propagation by finding the gradient w.r.t mean_loc, hence there is nothing coming from the non-differentiable part.

Hippogriff avatar Feb 06 '17 18:02 Hippogriff

@GodOfProbability That's very interesting... I will try that! Thank you very much for the suggestion! I will let you know what I find out!

qihongl avatar Feb 07 '17 05:02 qihongl

@GodOfProbability @QihongL We make an assumption that we don't sampling but use the mean_loc straightly, just similar with the soft attention in "show, attend and tell" Then, the question are:

  1. Do you think it will work well?
  2. Of course, gradient of mean_loc will flow across time in this case. Then, this kind of gradient would be "bad" just as you said? "bad" gradient from flowing across time. then the gradients flow over time (across different glimpses).

JasonZhao001 avatar Feb 27 '17 17:02 JasonZhao001

@JasonZhao001 If you stop gradient at sample_loc, the bad_gradients will not flow, because only sample_loc interacts across time, and if you stop gradient flowing from sample_loc, you are actually stopping gradient (across time) to flow through mean_loc. However, there are gradients coming from the loss function that corresponds to reward function, wich should flow through mean_loc. The gradient should flow through mean_loc from sample_loc (this gradient comes from the differentiation of the monte carlo approximation of the gradient of the reward function.) If time permits, you should do the experiments, and let us know.

Hippogriff avatar Feb 28 '17 00:02 Hippogriff

@GodOfProbability Yes, you are right! The parameters at location generation module rely on the the derivative of log[P(sample_loc|mean_loc,sigma)] w.r.t. parameters_loc to update, which is actually the derivative of mean_loc w.r.t. parameters_loc. And I will do experiments on it later, and I will try the assumption as well. And will report my results then.

JasonZhao001 avatar Feb 28 '17 01:02 JasonZhao001

@QihongL @GodOfProbability @jlindsey15 @Lzc6996 It proves working well when I comment the line mean_loc = tf.stop_gradient(mean_loc) as Gopal described above. When using the bandwidth = 12, it converge at more than 96% accuracy 600k time step (early stop). And I'm sure that it will get Mnih's result by tuning some parameters when training. By the way, if you stop_gradient at mean_loc, it shows in tensorboard that the parameters here never update when training. And I have a possible reason why it still works in that implementation. It is because that the attention window with three scales especially the 2ed and 3rd one could cover enough information (48 w.r.t. 60), meanwhile, the bandwidth is large enough to recognize a little lower resolution from the 2ed and 3rd scale (1/2 and 1/4 times respectively). So it result relies on the fully connected layers for classification and it is actually the same with the fixed location glimpse to recognition. You can have a try by using only two scales (e.g. 12 and 24) or make the bandwidth smaller (e.g. 8 or 6), then it will not work so well. It is same for the work in https://github.com/zhongwen/RAM Moreover, if you fix the problem, when make the bandwidth smaller (e.g. 8), it will perform better with higher accuracy and faster convergence. I did the experiment that it will converge to 97% at 400k time step! (So I make an early stop) Furthermore, I found it does not apply the M times samples in this original implementation. I plan to try it and any suggestions from you will help it a lot! If I succeed, I will share the sourse code as well. Thanks!

JasonZhao001 avatar Mar 03 '17 03:03 JasonZhao001

@JasonZhao001 @GodOfProbability Hi Jason, I read your comment very interestingly, I also solve the stop_gradient problem, but cannot achieve high performance as like you (in my case, about 94%, translated case) Can I know your hyperparameters and learning strategies? Moreover, except for mean_loc issue, do you think that the baseline in this code is correctly implemented? I think that baseline should be also learnable.. plz give me your opinion! thx!

jtkim-kaist avatar Mar 03 '17 04:03 jtkim-kaist

@jtkim-kaist The baseline tech is very important to location prediction. It is learnable as an extra term of cost function as is shown in the source code below:

    J = J - tf.reduce_sum(tf.square(R - b), 1)

Note that the parameters at baseline part is learnt separately with the other to parts. Of course, I have modified some of the hyperparas to make it work better, but I know its not the best, I'm still trying. If I succeed, I will post my implementation later.

JasonZhao001 avatar Mar 03 '17 07:03 JasonZhao001

@JasonZhao001 Thank your for your kind comment,

I also agree with you. However, in this code, the baseline is implemented like below,

baseline = tf.sigmoid(tf.matmul(hiddensState, Wb_h_b)+Bb_h_b)

and, both Wb_h_b and Bb_h_b seems that they can't learn due to stop_gradient function.

when stop_gradient function is off, the baseline depends on hidden_state so that it seems not right as you said (Note that the parameters at baseline part is learnt separately with the other to parts.)

So, I think, "baseline = tf.sigmoid(some independent variable to model)" is more appropriate.

Please give me your opinion thx! (I'm also in proceed)

jtkim-kaist avatar Mar 03 '17 08:03 jtkim-kaist

@jtkim-kaist b shouldn't be updated from this term: J = tf.concat(1, [tf.log(p_y + SMALL_NUM) * (onehot_labels_placeholder), tf.log(p_loc + SMALL_NUM) * (R - no_grad_b)]) where no_grad_b = tf.stop_gradient(b) can prevent it to update.

While b is updated via this term: J = J - tf.reduce_sum(tf.square(R - b), 1) where b is not stopped gradient.

JasonZhao001 avatar Mar 03 '17 08:03 JasonZhao001

@JasonZhao001 Thank you! I missed that part.

I'll expect your implementation have a good day!

jtkim-kaist avatar Mar 03 '17 09:03 jtkim-kaist

@jtkim-kaist You are welcome :)

JasonZhao001 avatar Mar 03 '17 09:03 JasonZhao001