Paragraph-level-Simplification-of-Medical-Texts
Paragraph-level-Simplification-of-Medical-Texts copied to clipboard
How are ROUGE/BLEU/SARI calculated?
Hi @AshOlogn,
I'd like to replicate the evaluation of the pre-trained models. How exactly were ROUGE/BLEU/SARI (Table 6 in paper) computed? Could you provide your evaluation script? I made an attempt with a custom evaluation script and get results which are quite different from the paper (see below).
Thanks!
Attachment
This is what I tried.
- Create environment (see below)
- Download pre-trained model
bart-no-ul
as per README - Run generation for test set
sh scripts/generate/bart_gen_no-ul.sh
(with--generate_end_index=None
) - Evaluate with a custom script (see below)
R-1 = 46.94 // paper: 40.0
R-2 = 19.22 // paper: 15.0
R-L = 43.77 // paper: 37.0
BLEU = 15.73 // paper: 44.0
SARI = 35.44 // paper: 38.0
environment
conda create -n parasimp \
python=3.7 \
pytorch=1.7.1 \
cudatoolkit=11.0 \
-c pytorch -c defaults
conda activate parasimp
pip install pytorch-lightning==0.9.0 transformers==3.3.1 rouge_score nltk gdown
pip install -U "protobuf<=3.21"
git clone https://github.com/feralvam/easse.git
cd easse
pip install -e .
evaluate.py
import json
from easse.sari import corpus_sari
from easse.bleu import corpus_bleu
from utils import calculate_rouge
with open('trained_models/bart-no-ul/gen_nucleus_test_1_0-none.json') as fin:
sys_sents = json.load(fin)
sys_sents = [x['gen'] for x in sys_sents]
with open('data/data-1024/test.source') as fin:
orig_sents = [l.strip() for l in fin.readlines()]
with open('data/data-1024/test.target') as fin:
refs_sents = [l.strip() for l in fin.readlines()]
scores = calculate_rouge(sys_sents, refs_sents)
print('R-1 = {:.2f}'.format(scores['rouge1']))
print('R-2 = {:.2f}'.format(scores['rouge2']))
print('R-L = {:.2f}'.format(scores['rougeLsum']))
bleu = corpus_bleu(
sys_sents=sys_sents,
refs_sents=[[t for t in refs_sents]],
lowercase=False
)
print(f'BLEU = {bleu:.2f}')
sari = corpus_sari(
orig_sents=orig_sents,
sys_sents=sys_sents,
refs_sents=[[t for t in refs_sents]]
)
print(f'SARI = {sari:.2f}')
Hi @jantrienes,
I also raised an issue about evaluation metrics before, and I think maybe we can have some discussion here. For the bart-no-ul reimplementation, I got the following results: R-1 = 46.78 // yours: 46.94, paper: 40.0 R-2 = 19.24 // yours: 19.22, paper: 15.0 R-L = 25.97 // yours: 43.77, paper: 37.0 BLEU = 11.52 // yours: 15.73, paper: 44.0 SARI = 38.72 // yours: 35.44, paper: 38.0
For the first 3 Rouge scores, I used rouge_score package. You can see our R-1 and R-2 are almost same, while I don't understand why my R-L value is such low.
For the last 2 scores, BLEU and SARI, I used evaluate package (a huggingface library). My SARI score is very close to the paper one, so I guess they may also use the same package. By the way, I also tried to calculate all rouge scores via evaluate package: R-1 = 44.61 // yours: 46.94, paper: 40.0 R-2 = 18.31 // yours: 19.22, paper: 15.0 R-L = 25.07 // yours: 43.77, paper: 37.0
Hi @LuJunru thanks for your reply. ROUGE, SARI and BLEU have several parameters and results depend on the preprocessing. So I think we have to do a bit of guesswork here.
As for ROUGE, I used the implementation of the authors in modeling/utils.py. For the R-L metric, sentences need to be separated by newlines, which are added there. Have you used a similar pre-processing before calling rouge_score
?
https://github.com/AshOlogn/Paragraph-level-Simplification-of-Medical-Texts/blob/edf6504ea28b2458ec6b4c172482ad15387aeeef/modeling/utils.py#L497-L501
SARI scores look quite close indeed. For the BLEU scores, I do not have any new insights.
Hi @jantrienes,
thanks for the explanations about R-L. I actually found there are RougeL and RougeLsum notations in the rouge_score package. And you can see the authors implement Rouge calculation based on exactly this rouge_score package. However, the original package does not support RougeLsum.
As for the BLEU scores, I tested NLTK package, and I found the value depends on the weights of BLEU 1~4. The BLEU results become 31.27 if I set following weights:
BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference], hypothesis, weights=(1, 0, 0, 0))