a-PyTorch-Tutorial-to-Image-Captioning
a-PyTorch-Tutorial-to-Image-Captioning copied to clipboard
What is the purpose of batch_size_t in model.py?
I am not clear with this code in model.py
for t in range(max(decode_lengths)):
batch_size_t = sum([l > t for l in decode_lengths])
attention_weighted_encoding, alpha = self.attention(encoder_out[:batch_size_t],
h[:batch_size_t])
gate = self.sigmoid(self.f_beta(h[:batch_size_t])) # gating scalar, (batch_size_t, encoder_dim)
attention_weighted_encoding = gate * attention_weighted_encoding
h, c = self.decode_step(
torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
(h[:batch_size_t], c[:batch_size_t])) # (batch_size_t, decoder_dim)
preds = self.fc(self.dropout(h)) # (batch_size_t, vocab_size)
predictions[:batch_size_t, t, :] = preds
alphas[:batch_size_t, t, :] = alpha
especially batch_size_t = sum([l > t for l in decode_lengths]). Please explain why we need batch_size_t
I am also confused, i think there are a lot of repititious works when "t" is very small. Are you clear now? waiting for discussion.
In the "forward" function starting from line 180, the captions are sorted in decreasing length, i.e. from longest to shortest. So batch_size_t = sum([l > t for l in decode_lengths]) checks for captions longer than 't', so that only those images in the minibatch, if you like, can be decoded. The author explained it and used a diagram to explain the reasoning in the ReadMe of the repo.
It's important to note that all captions are padded to the same length, however the final argument of the forward function "caption_lengths" holds the true length of each caption, i.e. excluding the 'pad' elements. As a result, the author didn't want to use the 'pad' elements in training the model. If however it makes no difference to you or all the true lengths of your captions are the same, you could easily exclude the 'batch_size_t' variable and substitute it with ':' everywhere and remove batch_size_t = sum([l > t for l in decode_lengths]) completely and simply run the for-loop till the length of the captions