DialoGPT Extract human response from 6k multi-ref dataset

Hi,

I'm trying to reproduce the human response result in the paper and encounter some problem. I copied test.scored_refs.txt to dstc/data folder and use the first column as the keys. The eval result after running python extract_human.py and python batch_eval.py is

n_lines = 5994
NIST = [2.871, 3.246, 3.3125, 3.3229]
BLEU = [0.378, 0.1678, 0.0966, 0.0655]
METEOR = 0.10657856237003654
entropy = [6.61382916462754, 10.109370475853682, 11.032526832134234, 11.125019724262556]
diversity = [0.12143963906484984, 0.5817823864609064]
avg_len = 14.64330997664331

which is different from the paper, even the avg_len is wrong. I'm wondering which step is wrong and how to reproduce the result.

Thanks!

Sep 02 '20 07:09 andy920262

Hi, the human reference file has been uploaded. Please find it here: data/human.ref.6k.txt. You might want to use this human reference file to compute against the other references. Also, your total line number is not 6000. I am not sure what's the reason but maybe it is worth for an examination.

Sep 03 '20 18:09 dreasysnail

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:

cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt

The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.
The first column has some duplicated keys, so I replace them by distinct numbers.

and run

$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1

The result looks almost correct, except NIST4 which is 4.25 in the paper

n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667

NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?

Sep 04 '20 07:09 andy920262

@andy

Thanks for the update.

The file and the avg_len seems correct, but the eval result is still wrong.

I observed 2 problems and fixed them by the following script:
cat ../data/test.refs.txt | cut -f 2- | rev | cut -f 2- | rev > ./data/test.refs.tmp.txt
seq 6000 > ./data/keys.6k.txt
paste ./data/keys.6k.txt ./data/test.refs.tmp.txt > ./data/test.refs.txt
The human response is obtained from the last column of test.refs.txt, so I exclude the last column in the reference.

The first column has some duplicated keys, so I replace them by distinct numbers.

and run
$ python3 dstc.py human.6k.resp.txt --ref ./data/test.refs.txt --keys ./data/keys.6k.txt --vshuman -1
The result looks almost correct, except NIST4 which is 4.25 in the paper
n_lines = 6000
NIST = [2.9939, 3.412, 3.491, 3.5033]
BLEU = [0.3961, 0.179, 0.1071, 0.0748]
METEOR = 0.10636074642754038
entropy = [6.864962939185212, 10.213254208172751, 10.970525196688564, 10.99510001622831]
diversity = [0.1454816096487322, 0.6296332006446193]
avg_len = 13.100166666666667
NIST2 and NIST4 are very close in all other experiments, 3.5 seems more reasonable. Maybe the NIST4 score is typed wrong in the paper?

I obtained the same results as you so it's possible an error was made in the paper.

Jul 06 '21 08:07 theyorubayesian

DialoGPT DialoGPT copied to clipboard

Extract human response from 6k multi-ref dataset

DialoGPT
DialoGPT copied to clipboard