OpenNMT-py
OpenNMT-py copied to clipboard
Add bleu
Added BLEU from AllenNLP instead of what @vince62s suggested as there tensors can be used directly and minibatches with validation would work. In @vince62s 's reference of corpus_bleu, entire validation set would be required. Also this already excluded the pad tokens.
https://github.com/OpenNMT/OpenNMT-py/issues/1158
Thanks for the contribution, however you will need to change the logic. As a first PR you just need to add Bleu as an extra validation metric (look at how accuracy and ppl are done in the validation process). You don't need to change the loss related functions. Maybe later on we can change the loss fucntion itself to implement other things but this is not the point at this stage.
@vince62s so do you want all the members of the bleu class to be included in the Statistics class and updations to precision_matches etc as a part of the batch_stats update?
no, the class in its own file is fine. I am just saying that Bleu is another metric at the same level as PPL or ACC that's it.
@vince62s please see how it looks now. Thanks !
@vince62s to remove batch size a small refactor is required. Please have a look, also moved to bleupy
Hi. It's really nice to have this feature. I tried this and I have a silly question about it. Applying this feature, I witnessed very small bleu score on the dev set during training. However, when I used the same models to translate the same dev set, and calculated the bleu score of the results, I found that the bleu scores were much higher (and more reasonable), even with the beam size = 1. I am trying to understand why it appears to be different in these cases, but I don't have the answer for now. I wonder if I'm wrong somewhere or do you guys see the same problem? Do you have any idea about this?
Hi again, I tried to investigate this problem and I found that the BLEU score varied greatly when the valid_batch_size changed significantly. In details, when increasing the valid_batch_size, the number of predicted tokens seemed to increase, and consequently, the precision_totals also increased (true for every kind of n-grams). Whereas, the precision_matches didn't seem to be dependent on the valid_batch_size. The count clip seemed to work fine here! So to summary, when the valid_batch_size increases, for every kind of n-grams, the precision_totals increases significantly, while the precision_matches stays the same, consequently, this obviously decreases the bleu score. I prefer to set valid_batch_size = 1 in this case.
Any update on this one? Would be great to have this. Considering that it is implemented in the TF version, some inspiration can be found there.
Preferably the BLEU implementation that is used should be identical between the TF and PT version of onmt (!)
You can display the bleu scores in tensorboard by adding another scalar at the end of statistics.py!