a-PyTorch-Tutorial-to-Transformers
a-PyTorch-Tutorial-to-Transformers copied to clipboard
grouping according to similar lengths
Hi @sgrvinod Thank you for your Tutorial posted for Attention is all you need. I have a small question, and would appreciate an answer.
In data loader.py you've grouped the batches according to their lengths, so that a batch has similar lengths. Is that necessary to be done? I do understand that it speeds up the training and reduces memory. But my question is does it have any effect on the performance if I don't group the data according to the lengths?
Thanks
Hello @homelifes
I also received your email but I will answer here in case it is helpful for others.
I think you'd definitely need to batch sequences such that each batch contains the same number of tokens, because the predictions are at the token level, and therefore losses are averaged over tokens. A batch with fewer tokens will grant a disproportional amount of influence to those tokens.
About whether you need to group sequences of the same or similar lengths in a batch - I don't think you need to do this.
Grouping sequences by length is mainly for two reasons:
- There are fewer or no pads that you're computing over.
- For a fixed number of tokens in the batch, a batch tensor containing sequences of varying lengths will take much more memory than a batch tensor containing sequences of the same length. For example, if most sequences in a batch are of an average length of 50, but you have just one sequence that is of length 95, that's a lot of padding on every sequence on that batch, resulting in a much larger tensor.
So, if this is not a concern, then I think you'd be fine having sequences of varying lengths in your batches as long as there are approximately the same number of tokens being predicted in the batch. In fact, it might even be better because now your batching can be much more random each epoch, whereas trying to ensure source and target sequences are of the same length will reduce the variation in batches' contents. (That is, trying to minimize padding in batches like I did may result in the same batches each time, and it is just the order of the batches that is shuffled for different epochs.)
Thanks for your answer @sgrvinod .
I think you'd definitely need to batch sequences such that each batch contains the same number of tokens, because the predictions are at the token level, and therefore losses are averaged over tokens. A batch with fewer tokens will grant a disproportional amount of influence to those tokens.
But according to your code, you aren't computing the loss across the pads. So therefore the loss calculated is only for the non-padded tokens, as written here:
inputs, _, _, _ = pack_padded_sequence(input=inputs,
lengths=lengths,
batch_first=True,
enforce_sorted=False) # (sum(lengths), vocab_size)
targets, _, _, _ = pack_padded_sequence(input=targets,
lengths=lengths,
batch_first=True,
enforce_sorted=False) # (sum(lengths))
So why would there be an imbalance?