Paragraph-level-Simplification-of-Medical-Texts icon indicating copy to clipboard operation
Paragraph-level-Simplification-of-Medical-Texts copied to clipboard

How are ROUGE/BLEU/SARI calculated?

Open jantrienes opened this issue 2 years ago • 3 comments

Hi @AshOlogn,

I'd like to replicate the evaluation of the pre-trained models. How exactly were ROUGE/BLEU/SARI (Table 6 in paper) computed? Could you provide your evaluation script? I made an attempt with a custom evaluation script and get results which are quite different from the paper (see below).

Thanks!

Attachment

This is what I tried.

  1. Create environment (see below)
  2. Download pre-trained model bart-no-ul as per README
  3. Run generation for test set sh scripts/generate/bart_gen_no-ul.sh (with --generate_end_index=None)
  4. Evaluate with a custom script (see below)
R-1 = 46.94      // paper: 40.0
R-2 = 19.22      // paper: 15.0
R-L = 43.77      // paper: 37.0
BLEU = 15.73     // paper: 44.0
SARI = 35.44     // paper: 38.0 

environment

conda create -n parasimp \
  python=3.7 \
  pytorch=1.7.1 \
  cudatoolkit=11.0 \
  -c pytorch -c defaults

conda activate parasimp
pip install pytorch-lightning==0.9.0 transformers==3.3.1 rouge_score nltk gdown
pip install -U "protobuf<=3.21" 

git clone https://github.com/feralvam/easse.git
cd easse
pip install -e .

evaluate.py

import json

from easse.sari import corpus_sari
from easse.bleu import corpus_bleu
from utils import calculate_rouge

with open('trained_models/bart-no-ul/gen_nucleus_test_1_0-none.json') as fin:
    sys_sents = json.load(fin)
    sys_sents = [x['gen'] for x in sys_sents]
with open('data/data-1024/test.source') as fin:
    orig_sents = [l.strip() for l in fin.readlines()]
with open('data/data-1024/test.target') as fin:
    refs_sents = [l.strip() for l in fin.readlines()]

scores = calculate_rouge(sys_sents, refs_sents)
print('R-1 = {:.2f}'.format(scores['rouge1']))
print('R-2 = {:.2f}'.format(scores['rouge2']))
print('R-L = {:.2f}'.format(scores['rougeLsum']))

bleu = corpus_bleu(
    sys_sents=sys_sents,
    refs_sents=[[t for t in refs_sents]],
    lowercase=False
)
print(f'BLEU = {bleu:.2f}')

sari = corpus_sari(
    orig_sents=orig_sents,  
    sys_sents=sys_sents, 
    refs_sents=[[t for t in refs_sents]]
)
print(f'SARI = {sari:.2f}')

jantrienes avatar Jul 20 '22 10:07 jantrienes

Hi @jantrienes,

I also raised an issue about evaluation metrics before, and I think maybe we can have some discussion here. For the bart-no-ul reimplementation, I got the following results: R-1 = 46.78 // yours: 46.94, paper: 40.0 R-2 = 19.24 // yours: 19.22, paper: 15.0 R-L = 25.97 // yours: 43.77, paper: 37.0 BLEU = 11.52 // yours: 15.73, paper: 44.0 SARI = 38.72 // yours: 35.44, paper: 38.0

For the first 3 Rouge scores, I used rouge_score package. You can see our R-1 and R-2 are almost same, while I don't understand why my R-L value is such low.

For the last 2 scores, BLEU and SARI, I used evaluate package (a huggingface library). My SARI score is very close to the paper one, so I guess they may also use the same package. By the way, I also tried to calculate all rouge scores via evaluate package: R-1 = 44.61 // yours: 46.94, paper: 40.0 R-2 = 18.31 // yours: 19.22, paper: 15.0 R-L = 25.07 // yours: 43.77, paper: 37.0

LuJunru avatar Aug 11 '22 16:08 LuJunru

Hi @LuJunru thanks for your reply. ROUGE, SARI and BLEU have several parameters and results depend on the preprocessing. So I think we have to do a bit of guesswork here.

As for ROUGE, I used the implementation of the authors in modeling/utils.py. For the R-L metric, sentences need to be separated by newlines, which are added there. Have you used a similar pre-processing before calling rouge_score?

https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts/blob/edf6504ea28b2458ec6b4c172482ad15387aeeef/modeling/utils.py#L497-L501

SARI scores look quite close indeed. For the BLEU scores, I do not have any new insights.

jantrienes avatar Aug 11 '22 17:08 jantrienes

Hi @jantrienes,

thanks for the explanations about R-L. I actually found there are RougeL and RougeLsum notations in the rouge_score package. And you can see the authors implement Rouge calculation based on exactly this rouge_score package. However, the original package does not support RougeLsum.

As for the BLEU scores, I tested NLTK package, and I found the value depends on the weights of BLEU 1~4. The BLEU results become 31.27 if I set following weights:

BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights=(1, 0, 0, 0))

LuJunru avatar Aug 11 '22 17:08 LuJunru