Taskmaster icon indicating copy to clipboard operation
Taskmaster copied to clipboard

Unable to reproduce results from paper (Table 3)

Open jatinganhotra opened this issue 5 years ago • 0 comments

Hi,

I am trying to reproduce the results mentioned in Table 3 in the paper Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset (https://www.aclweb.org/anthology/D19-1459.pdf)

The paper reports BLEU score - 6.11, but I’m getting 4.34 from moses-multi-bleu and 5.99 (BLEU_uncased), 5.55 (BLEU_cased) from t2t-bleu .


I am using the T2T code to create a new Problem for Taskmaster and train a transformer model (https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md ). Here are the steps I followed - 



  1. Data creation - For training the Transformer model , I create <inputs, targets> pairs as follows: For a dialog, say [U1, S1, U2, S2, U3, S3, U4], where U(j) is j-th USER utterance and S(k) is the k-th ASSISTANT utterance, I generate the following pairs -

Input - U1 __eou__ __eot__
 Target - S1 



Input - U1 __eou__ __eot__
 S1 __eou__ __eot__
 U2 __eou__ __eot__
 
Target - S2
 
 Input - U1 __eou__ __eot__ S1 __eou__ __eot__ U2 __eou__ __eot__
 S2 __eou__ __eot__
 U3 __eou__ __eot__ 
 Target - S3



Question 1 - Did the authors also add special tokens __eou__, __eot__ to denote end of utterance and end of turn? If no, how were the <inputs, targets> pairs created for training the model? 
Question 2 - Did the <inputs, targets> pairs also include USER utterances as targets for training the model?



  1. I’m using the following hyper-parameters as mentioned in the paper - 


hparams.num_hidden_layers = 2
hparams.hidden_size = 256
hparams.num_heads = 4

# 0.2 set based on details in paper
hparams.attention_dropout = 0.2
hparams.layer_prepostprocess_dropout = 0.2


# As per the original Transformer (https://github.com/tensorflow/tensor2tensor/blob/ab9fb79b834a69433fe4d82f98ecd73d9ed9f853/tensor2tensor/models/transformer.py#L1761) as not mentioned in paper, only details provided are ADAM optimizer (β1 = 0.85, β2 = 0.997)

hparams.learning_rate = 0.1



# Filter size used not mentioned in the paper

hparams.filter_size = 512

Question 3 - Can you please confirm if the hyper-parameters above are correct?



  1. Vocal size used not mentioned in the paper. I used (2 ** 12 # 4096) when creating the dataset. 

def approx_vocab_size(self):
   return 2 ** 12 # 4096  or     2 ** 13  # ~8k


Question 4 - Did the authors use (2 ** 13 # ~8k) as vocal size?



  1. I train the model for 500000 steps on a single GPU (Tesla V100 PCIe 32GB) using -

t2t-trainer \

--data_dir=$DATA_DIR \

--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer   \

--problem=taskmaster \

--model=transformer \

--hparams_set=transformer_taskmaster \

--output_dir=$OUTDIR \

--train_steps=500000 
--worker_gpu=1 \

--hparams_set=transformer_taskmaster


and these are the metrics at step 500000 -

INFO:tensorflow:Saving dict for global step 500000: global_step = 500000, 

loss = 3.6279776,

metrics-taskmaster/targets/accuracy = 0.39591935, 

metrics-taskmaster/targets/accuracy_per_sequence = 0.0073260074, 

metrics-taskmaster/targets/accuracy_top5 = 0.6327749, 

metrics-taskmaster/targets/approx_bleu_score = 0.04866978, 

metrics-taskmaster/targets/neg_log_perplexity = -3.5477512, 

metrics-taskmaster/targets/rouge_2_fscore = 0.13960487, 

metrics-taskmaster/targets/rouge_L_fscore = 0.17148268



  1. Decoding on test set. The test set <inputs, targets> pair were created in the same way as mentioned above, i.e. only asking the model to generate assistant’s utterances given the dialog context so far.

BEAM_SIZE=4
ALPHA=0.6

t2t-decoder 
--data_dir=$DATA_DIR 
--problem=$PROBLEM 
--model=$MODEL 
--hparams_set=$HPARAMS 
--output_dir=$OUTDIR  
--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA"   --decode_from_file=$DECODE_FILE


and then I use the moses-multi-bleu (https://github.com/google/seq2seq/blob/master/seq2seq/metrics/bleu.py) to compute the BELU score = 4.34.

Question 4 - Is the above training and eval configuration correct to reproduce the results?


  1. However, if I use t2t-bleu script to get the BLEU score, I get BLEU_uncased = 5.99 BLEU_cased = 5.55

Quesiton 5 - Why is there a difference between the BLEU scores when using the above 2, and which one is correct?

Please let me know if you need additional information. Thanks

jatinganhotra avatar Feb 06 '20 02:02 jatinganhotra