NA-NMT Shouldn't the loss be normalized by the batch_size, not the number of words in a batch?

Shouldn't the loss be normalized by the batch_size, not the number of words in a batch?

Open hyhieu opened this issue 6 years ago • 1 comments

https://github.com/MultiPath/NA-NMT/blob/4054d606bf511e98aec82bc030034fa67dccc5f1/model.py#L976

Luong et al., (2017) described this implementation in TensorFlow.

size_average=True
ignore_index=-100
reduce=True

which means that you are averaging the loss by the number of words, not the number of sentences in a batch. Also, it looks like you are not handling ignore_index.

Mar 07 '18 23:03 hyhieu

Hi,

Generally, it will be ok to do either normalizing to batch or words. The original transformer code seems to normalize to words.

Also, as our data loader uses an adaptive batch-size to keep the number of total incoming words for each batch almost the same. It makes more sense to normalize to words.

In 976, I did not handle the ignore_index as I have already apply the mask in 974. Thanks

Mar 08 '18 18:03 MultiPath

NA-NMT NA-NMT copied to clipboard

Shouldn't the loss be normalized by the batch_size, not the number of words in a batch?

NA-NMT
NA-NMT copied to clipboard