DialoGPT
DialoGPT copied to clipboard
Extract human response from 6k multi-ref dataset
Hi,
I'm trying to reproduce the human response result in the paper and encounter some problem.
I copied test.scored_refs.txt
to dstc/data
folder and use the first column as the keys.
The eval result after running python extract_human.py
and python batch_eval.py
is
n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331
which is different from the paper, even the avg_len
is wrong.
I'm wondering which step is wrong and how to reproduce the result.
Thanks!
Hi, the human reference file has been uploaded. Please find it here: data/human.ref.6k.txt
. You might want to use this human reference file to compute against the other references. Also, your total line number is not 6000. I am not sure what's the reason but maybe it is worth for an examination.
Thanks for the update.
The file and the avg_len
seems correct, but the eval result is still wrong.
I observed 2 problems and fixed them by the following script:
cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
- The human response is obtained from the last column of
test.refs.txt
, so I exclude the last column in the reference. - The first column has some duplicated keys, so I replace them by distinct numbers.
and run
$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1
The result looks almost correct, except NIST4 which is 4.25 in the paper
n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667
NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?
@andy
Thanks for the update.
The file and the
avg_len
seems correct, but the eval result is still wrong.I observed 2 problems and fixed them by the following script:
cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt seq 6000 > ./data/keys.6k.txt paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
- The human response is obtained from the last column of
test.refs.txt
, so I exclude the last column in the reference.- The first column has some duplicated keys, so I replace them by distinct numbers.
and run
$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1
The result looks almost correct, except NIST4 which is 4.25 in the paper
n_lines = 6000 NIST = [2.9939, 3.412, 3.491, 3.5033] BLEU = [0.3961, 0.179, 0.1071, 0.0748] METEOR = 0.10636074642754038 entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831] diversity = [0.1454816096487322, 0.6296332006446193] avg_len = 13.100166666666667
NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?
I obtained the same results as you so it's possible an error was made in the paper.