TSP_DRL_PtrNet possible error in `critic.py` `forward()`?

possible error in `critic.py` `forward()`?

Open jingweiz opened this issue 3 years ago • 5 comments

Very helpful repo! One question, in the forward function in critic.py, there might possibly be an error: In line 37, the Decoder always takes in the same initial dec_input for each city in the sequence, while it should actually take in the output from the last city? Like in actor.py the dec_input is updated after processing each city. Thanks in advance and looking forward to your reply!

updates below: actually I think I got messed up. Now my understanding is that for the actor, the dec_input should be the embedding of the sampled action according to the probability output of the corresponding time step, instead of the the updated weighted sum of ref as it is currently done in actor.py. But I'm then very confused as how this should be done in critic.py, should it sample seperately than actor?

Nov 12 '20 03:11 jingweiz

I took a look at other's codes and saw how the decoder input in critic model was implemented. As you point out, dec_input in critic.py should be updated at every decoder time step like other's codes. Thank you for letting me know! I will change it later.

Some code initialize decoder input in critic to zero tensor by torch.zeros, some code takes encoder output as a decoder input. I'm not sure there is a certain correct way to implement it. I set a decoder input to a learnable parameter. My code would still work since the encoder passes hidden state((h, c)) to the decoder and it also passes (h, c) to the decoder at next time step.

Nov 12 '20 07:11 Rintarooo

Thanks for the quick reply! I've been digging the pointer nets paper and the tsp drl paper for the past few hours, and now this is what I think:

for the actor: after you pass embed_enc_inputs through the encoder and get enc_h and (h, c), then in a for i in range(city_t) loop, when i==0, a trainable dec_input should be passed as the input to the decoder (this is consistent w/ the paper); but when i>0, what should be the input for i is: in step i-1, after you sample from multinomial and get a next_node, you use this next_node to get the embedding corresponding to this next_node, something like: dec_input = torch.gather(input=embed_enc_inputs, dim=1, index=next_node.unsqueeze(-1).repeat(1, 1, embed)), then this dec_input is used as the decoder input for the next step i.
for the critic: after you pass embed_enc_inputs through the encoder and get enc_h and (h, c), the difference here is that you don't loop through the sequence, but rather you just loop through the number of process steps, which is 3 in the paper. so then in the loop for i in range(num_process_steps), when i==0, you pass in the h that you get from the encoder as the input to the processer which is also an lstm, and maybe just initialize the (processer_h, processer_c) to 0, something like query=h, query, (processer_h, processer_c)=processer(query, (processer_h, processer_c)), then you put this query into the glimpse function and get an updated query; then for i>0, you just use this updated query as the input to the processer then again glimpse. After 3 iterations, the final query is fed into a 2layer fully connected decoder to get the value estimate.

I have no idea if this the correct way to implement the paper but this is the best I can do to interpret what is written :/ unfortunately some details are very confusing :/ Let me know what do you think, it's nice to have someone to discuss over this lol

Nov 12 '20 09:11 jingweiz

And why I think that for the critic the loop is not over city_t but rather num_process_steps is that: if you still loop through the sequence and the dec_input is updated the same way as in the actor, that means that in the critic you also need to do a multinomial sampling in each step, which during training means that, for the same training data, actor and critic might sample out a different action sequence or different tours, then this does not make any sense, since you want the critic to predict the expected tour length of the tour generated by the actor. Maybe you can also feed the actions selected by the actor to the critic to accomplish this, but that just makes everything overly complicated and I do not read anything like that from the paper.

Nov 12 '20 09:11 jingweiz

I am quite impressed by your deep understanding and really appreciate your clear explanation!

Your thoughts makes a lot of sense to me that critic model doesn't need the iteration over the number of city nodes, since critic doesn't need masking scheme unlike actor model. Also for actor model, it sounds natural to use a part of embed_inc_inputs as an input.

I tried out some experiment while changing codes around decoder input in critic.py. As the DRL paper and you propose, I put LSTM in the process block iteration before feeding hidden state vector(=query) into glimpse function. Then, I run the code and training process didn't look good, so I removed(comment out) LSTM and directly use glimpse function. I run the code again, training went well.

My implementation might not exactly match the what the DRL paper says, but I ended up making training process shorter and my code closer to the paper thanks to your advice. Thank you so much.

Nov 13 '20 08:11 Rintarooo

Hi, thanks a lot for the reply! Your repo helps me a lot in understanding the technical details, I'll try to experiment a bit more and turn for your help if anything further confusions come in the way. Thanks and good luck with your research!

Nov 16 '20 11:11 jingweiz

TSP_DRL_PtrNet TSP_DRL_PtrNet copied to clipboard

possible error in `critic.py` `forward()`?

TSP_DRL_PtrNet
TSP_DRL_PtrNet copied to clipboard