NA-NMT
NA-NMT copied to clipboard
Shouldn't the loss be normalized by the batch_size, not the number of words in a batch?
https://github.com/MultiPath/NA-NMT/blob/4054d606bf511e98aec82bc030034fa67dccc5f1/model.py#L976
Luong et al., (2017) described this implementation in TensorFlow.
In PyTorch, the default behavior is
size_average=True
ignore_index=-100
reduce=True
which means that you are averaging the loss by the number of words, not the number of sentences in a batch. Also, it looks like you are not handling ignore_index
.
Hi,
Generally, it will be ok to do either normalizing to batch or words. The original transformer code seems to normalize to words.
Also, as our data loader uses an adaptive batch-size to keep the number of total incoming words for each batch almost the same. It makes more sense to normalize to words.
In 976, I did not handle the ignore_index as I have already apply the mask in 974. Thanks