TSP_DRL_PtrNet
TSP_DRL_PtrNet copied to clipboard
possible error in `critic.py` `forward()`?
Very helpful repo!
One question, in the forward
function in critic.py
, there might possibly be an error:
In line 37
, the Decoder
always takes in the same initial dec_input
for each city in the sequence, while it should actually take in the output from the last city? Like in actor.py
the dec_input
is updated after processing each city.
Thanks in advance and looking forward to your reply!
updates below:
actually I think I got messed up. Now my understanding is that for the actor
, the dec_input
should be the embedding of the sampled action according to the probability output of the corresponding time step, instead of the the updated weighted sum of ref
as it is currently done in actor.py
. But I'm then very confused as how this should be done in critic.py
, should it sample seperately than actor
?
I took a look at other's codes and saw how the decoder input in critic model was implemented. As you point out, dec_input
in critic.py
should be updated at every decoder time step like other's codes. Thank you for letting me know! I will change it later.
Some code initialize decoder input in critic to zero tensor by torch.zeros
, some code takes encoder output as a decoder input. I'm not sure there is a certain correct way to implement it. I set a decoder input to a learnable parameter. My code would still work since the encoder passes hidden state((h, c)
) to the decoder and it also passes (h, c)
to the decoder at next time step.
Thanks for the quick reply! I've been digging the pointer nets paper and the tsp drl paper for the past few hours, and now this is what I think:
- for the
actor
: after you passembed_enc_inputs
through theencoder
and getenc_h
and(h, c)
, then in afor i in range(city_t)
loop, wheni==0
, a trainabledec_input
should be passed as the input to thedecoder
(this is consistent w/ the paper); but wheni>0
, what should be the input fori
is: in stepi-1
, after you sample frommultinomial
and get anext_node
, you use thisnext_node
to get the embedding corresponding to thisnext_node
, something like:dec_input = torch.gather(input=embed_enc_inputs, dim=1, index=next_node.unsqueeze(-1).repeat(1, 1, embed))
, then thisdec_input
is used as the decoder input for the next stepi
. - for the
critic
: after you passembed_enc_inputs
through theencoder
and getenc_h
and(h, c)
, the difference here is that you don't loop through the sequence, but rather you just loop through the number of process steps, which is3
in the paper. so then in the loopfor i in range(num_process_steps)
, wheni==0
, you pass in theh
that you get from theencoder
as the input to the processer which is also an lstm, and maybe just initialize the(processer_h, processer_c)
to0
, something likequery=h
,query, (processer_h, processer_c)=processer(query, (processer_h, processer_c))
, then you put thisquery
into theglimpse
function and get an updatedquery
; then fori>0
, you just use this updatedquery
as the input to theprocesser
then againglimpse
. After 3 iterations, the finalquery
is fed into a 2layer fully connected decoder to get the value estimate.
I have no idea if this the correct way to implement the paper but this is the best I can do to interpret what is written :/ unfortunately some details are very confusing :/ Let me know what do you think, it's nice to have someone to discuss over this lol
And why I think that for the critic
the loop is not over city_t
but rather num_process_steps
is that: if you still loop through the sequence and the dec_input
is updated the same way as in the actor
, that means that in the critic
you also need to do a multinomial
sampling in each step, which during training means that, for the same training data, actor
and critic
might sample out a different action sequence or different tour
s, then this does not make any sense, since you want the critic
to predict the expected tour length of the tour generated by the actor
. Maybe you can also feed the actions selected by the actor
to the critic
to accomplish this, but that just makes everything overly complicated and I do not read anything like that from the paper.
I am quite impressed by your deep understanding and really appreciate your clear explanation!
Your thoughts makes a lot of sense to me that critic model doesn't need the iteration over the number of city nodes, since critic doesn't need masking scheme unlike actor model. Also for actor model, it sounds natural to use a part of embed_inc_inputs
as an input.
I tried out some experiment while changing codes around decoder input in critic.py
. As the DRL paper and you propose, I put LSTM in the process block iteration before feeding hidden state vector(=query) into glimpse function. Then, I run the code and training process didn't look good, so I removed(comment out) LSTM and directly use glimpse function. I run the code again, training went well.
My implementation might not exactly match the what the DRL paper says, but I ended up making training process shorter and my code closer to the paper thanks to your advice. Thank you so much.
Hi, thanks a lot for the reply! Your repo helps me a lot in understanding the technical details, I'll try to experiment a bit more and turn for your help if anything further confusions come in the way. Thanks and good luck with your research!