densecap
densecap copied to clipboard
Question about training caption model
Hello, I have a question about training cap_model with end-to-end maksed transformer.
In the code, the cap_model is trained with "window_mask = (gate_scores * pred_bin_window_mask.view(B, T, 1)". As I understand, the pred_bin_window_mask is extracted by prediction.
Therefore, Is caption model(cap_model) trained on the learned_proposal (not GT with label) ?? Is it right what I understand?
And, If cap model is trained on leanred_proposal, the model can be affected a lot defending o n the initial performance of learned_proposal. Therefore, it seems like to show unstable learning. If you have any misunderstandings, please point out that.
Thank you.