a-PyTorch-Tutorial-to-Image-Captioning icon indicating copy to clipboard operation
a-PyTorch-Tutorial-to-Image-Captioning copied to clipboard

What is the purpose of batch_size_t in model.py?

Open anesh-ml opened this issue 5 years ago • 2 comments

I am not clear with this code in model.py

for t in range(max(decode_lengths)):
           batch_size_t = sum([l > t for l in decode_lengths])
           attention_weighted_encoding, alpha = self.attention(encoder_out[:batch_size_t],
                                                               h[:batch_size_t])
           gate = self.sigmoid(self.f_beta(h[:batch_size_t]))  # gating scalar, (batch_size_t, encoder_dim)
           attention_weighted_encoding = gate * attention_weighted_encoding
           h, c = self.decode_step(
               torch.cat([embeddings[:batch_size_t, t, :], attention_weighted_encoding], dim=1),
               (h[:batch_size_t], c[:batch_size_t]))  # (batch_size_t, decoder_dim)
           preds = self.fc(self.dropout(h))  # (batch_size_t, vocab_size)
           predictions[:batch_size_t, t, :] = preds
           alphas[:batch_size_t, t, :] = alpha

especially batch_size_t = sum([l > t for l in decode_lengths]). Please explain why we need batch_size_t

anesh-ml avatar Sep 24 '20 10:09 anesh-ml

I am also confused, i think there are a lot of repititious works when "t" is very small. Are you clear now? waiting for discussion.

YPatrickW avatar Oct 09 '20 13:10 YPatrickW

In the "forward" function starting from line 180, the captions are sorted in decreasing length, i.e. from longest to shortest. So batch_size_t = sum([l > t for l in decode_lengths]) checks for captions longer than 't', so that only those images in the minibatch, if you like, can be decoded. The author explained it and used a diagram to explain the reasoning in the ReadMe of the repo. It's important to note that all captions are padded to the same length, however the final argument of the forward function "caption_lengths" holds the true length of each caption, i.e. excluding the 'pad' elements. As a result, the author didn't want to use the 'pad' elements in training the model. If however it makes no difference to you or all the true lengths of your captions are the same, you could easily exclude the 'batch_size_t' variable and substitute it with ':' everywhere and remove batch_size_t = sum([l > t for l in decode_lengths]) completely and simply run the for-loop till the length of the captions

abbaahmad avatar Jan 10 '21 15:01 abbaahmad