Why does the DecoderWithAttention still need encoded captions in forwarding when validating
Thanks for your work first! learned a lot.
In the forward function of DecoderWithAttention, I see at each output turn, the LSTMcell need an embedded encoded_captions, which is a supervised input. This can be understand in training, but in validating process, the place of the real embedded encoded caption should not be taken by the predicted caption last turn?
I don't know where the problem is?
h, c = self.decode_step(
torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
(h[:batch_size_t], c[:batch_size_t])) # (batch_size_t, decoder_dim)
I don't know if it is too late to comment...
I guess this is only for evaluation. If you check their code caption.py for inference, they used the embedding generated by the model in inference time only instead.
The validation evaluation is only used to be compared with the one in training time and select a model.