KoLA
KoLA copied to clipboard
Ambiguity on the evaluation metrics
Are you evaluating F1 or EM (ROUGE or BLEU) after all for these datasets? I have no idea reading this paper.
Also, BLEU has a lot of variants, which variant do you use for implementation?