Taskmaster
Taskmaster copied to clipboard
Unable to reproduce results from paper (Table 3)
Hi,
I am trying to reproduce the results mentioned in Table 3 in the paper Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset (https://www.aclweb.org/anthology/D19-1459.pdf)
The paper reports BLEU score - 6.11, but I’m getting 4.34 from moses-multi-bleu and 5.99 (BLEU_uncased), 5.55 (BLEU_cased) from t2t-bleu .
I am using the T2T code to create a new Problem for Taskmaster and train a transformer model (https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md ). Here are the steps I followed -
- Data creation - For training the Transformer model , I create <inputs, targets> pairs as follows: For a dialog, say
[U1, S1, U2, S2, U3, S3, U4]
, where U(j) is j-th USER utterance and S(k) is the k-th ASSISTANT utterance, I generate the following pairs -
Input - U1 __eou__ __eot__
Target - S1
Input - U1 __eou__ __eot__
S1 __eou__ __eot__
U2 __eou__ __eot__
Target - S2
Input - U1 __eou__ __eot__ S1 __eou__ __eot__ U2 __eou__ __eot__
S2 __eou__ __eot__
U3 __eou__ __eot__
Target - S3
Question 1 - Did the authors also add special tokens __eou__
, __eot__
to denote end of utterance and end of turn? If no, how were the <inputs, targets> pairs created for training the model?
Question 2 - Did the <inputs, targets> pairs also include USER utterances as targets for training the model?
- I’m using the following hyper-parameters as mentioned in the paper -
hparams.num_hidden_layers = 2
hparams.hidden_size = 256
hparams.num_heads = 4
# 0.2 set based on details in paper
hparams.attention_dropout = 0.2
hparams.layer_prepostprocess_dropout = 0.2
# As per the original Transformer (https://github.com/tensorflow/tensor2tensor/blob/ab9fb79b834a69433fe4d82f98ecd73d9ed9f853/tensor2tensor/models/transformer.py#L1761) as not mentioned in paper, only details provided are ADAM optimizer (β1 = 0.85, β2 = 0.997)
hparams.learning_rate = 0.1
# Filter size used not mentioned in the paper
hparams.filter_size = 512
Question 3 - Can you please confirm if the hyper-parameters above are correct?
- Vocal size used not mentioned in the paper. I used
(2 ** 12 # 4096)
when creating the dataset.
def approx_vocab_size(self):
return 2 ** 12 # 4096 or 2 ** 13 # ~8k
Question 4 - Did the authors use (2 ** 13 # ~8k) as vocal size?
- I train the model for 500000 steps on a single GPU (Tesla V100 PCIe 32GB) using -
t2t-trainer \
--data_dir=$DATA_DIR \
--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer \
--problem=taskmaster \
--model=transformer \
--hparams_set=transformer_taskmaster \
--output_dir=$OUTDIR \
--train_steps=500000
--worker_gpu=1 \
--hparams_set=transformer_taskmaster
and these are the metrics at step 500000 -
INFO:tensorflow:Saving dict for global step 500000: global_step = 500000,
loss = 3.6279776,
metrics-taskmaster/targets/accuracy = 0.39591935,
metrics-taskmaster/targets/accuracy_per_sequence = 0.0073260074,
metrics-taskmaster/targets/accuracy_top5 = 0.6327749,
metrics-taskmaster/targets/approx_bleu_score = 0.04866978,
metrics-taskmaster/targets/neg_log_perplexity = -3.5477512,
metrics-taskmaster/targets/rouge_2_fscore = 0.13960487,
metrics-taskmaster/targets/rouge_L_fscore = 0.17148268
- Decoding on test set. The test set <inputs, targets> pair were created in the same way as mentioned above, i.e. only asking the model to generate assistant’s utterances given the dialog context so far.
BEAM_SIZE=4
ALPHA=0.6
t2t-decoder
--data_dir=$DATA_DIR
--problem=$PROBLEM
--model=$MODEL
--hparams_set=$HPARAMS
--output_dir=$OUTDIR
--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" --decode_from_file=$DECODE_FILE
and then I use the moses-multi-bleu (https://github.com/google/seq2seq/blob/master/seq2seq/metrics/bleu.py) to compute the BELU score = 4.34.
Question 4 - Is the above training and eval configuration correct to reproduce the results?
- However, if I use t2t-bleu script to get the BLEU score, I get BLEU_uncased = 5.99 BLEU_cased = 5.55
Quesiton 5 - Why is there a difference between the BLEU scores when using the above 2, and which one is correct?
Please let me know if you need additional information. Thanks