DialoGPT icon indicating copy to clipboard operation
DialoGPT copied to clipboard

Extract human response from 6k multi-ref dataset

Open andy920262 opened this issue 4 years ago • 3 comments

Hi,

I'm trying to reproduce the human response result in the paper and encounter some problem. I copied test.scored_refs.txt to dstc/data folder and use the first column as the keys. The eval result after running python extract_human.py and python batch_eval.py is

n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331

which is different from the paper, even the avg_len is wrong. I'm wondering which step is wrong and how to reproduce the result.

Thanks!

andy920262 avatar Sep 02 '20 07:09 andy920262

Hi, the human reference file has been uploaded. Please find it here: data/human.ref.6k.txt. You might want to use this human reference file to compute against the other references. Also, your total line number is not 6000. I am not sure what's the reason but maybe it is worth for an examination.

dreasysnail avatar Sep 03 '20 18:09 dreasysnail

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
  1. The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
  2. The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?

andy920262 avatar Sep 04 '20 07:09 andy920262

@andy

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
  1. The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
  2. The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?

I obtained the same results as you so it's possible an error was made in the paper.

theyorubayesian avatar Jul 06 '21 08:07 theyorubayesian