Taskmaster Unable to reproduce results from paper (Table 3)

Unable to reproduce results from paper (Table 3)

Open jatinganhotra opened this issue 5 years ago • 0 comments

Hi,

I am trying to reproduce the results mentioned in Table 3 in the paper Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset (https://www.aclweb.org/anthology/D19-1459.pdf)

The paper reports BLEU score - 6.11, but I’m getting 4.34 from moses-multi-bleu and 5.99 (BLEU_uncased), 5.55 (BLEU_cased) from t2t-bleu .

 I am using the T2T code to create a new Problem for Taskmaster and train a transformer model (https://github.com/tensorflow/tensor2tensor/blob/master/docs/new_problem.md ). Here are the steps I followed -  

Data creation - For training the Transformer model , I create <inputs, targets> pairs as follows: For a dialog, say [U1, S1, U2, S2, U3, S3, U4], where U(j) is j-th USER utterance and S(k) is the k-th ASSISTANT utterance, I generate the following pairs -

Input - U1 __eou__ __eot__  Target - S1   

Input - U1 __eou__ __eot__  S1 __eou__ __eot__  U2 __eou__ __eot__   Target - S2    Input - U1 __eou__ __eot__ S1 __eou__ __eot__ U2 __eou__ __eot__  S2 __eou__ __eot__  U3 __eou__ __eot__   Target - S3 

 Question 1 - Did the authors also add special tokens __eou__, __eot__ to denote end of utterance and end of turn? If no, how were the <inputs, targets> pairs created for training the model?  Question 2 - Did the <inputs, targets> pairs also include USER utterances as targets for training the model? 

I’m using the following hyper-parameters as mentioned in the paper -

 hparams.num_hidden_layers = 2
hparams.hidden_size = 256
hparams.num_heads = 4

# 0.2 set based on details in paper
hparams.attention_dropout = 0.2
hparams.layer_prepostprocess_dropout = 0.2
 
# As per the original Transformer (https://github.com/tensorflow/tensor2tensor/blob/ab9fb79b834a69433fe4d82f98ecd73d9ed9f853/tensor2tensor/models/transformer.py#L1761) as not mentioned in paper, only details provided are ADAM optimizer (β1 = 0.85, β2 = 0.997)
 hparams.learning_rate = 0.1

  # Filter size used not mentioned in the paper 
hparams.filter_size = 512

Question 3 - Can you please confirm if the hyper-parameters above are correct? 

Vocal size used not mentioned in the paper. I used (2 ** 12 # 4096) when creating the dataset.

def approx_vocab_size(self):
   return 2 ** 12 # 4096  or     2 ** 13  # ~8k

Question 4 - Did the authors use (2 ** 13 # ~8k) as vocal size? 

I train the model for 500000 steps on a single GPU (Tesla V100 PCIe 32GB) using -

t2t-trainer \
 --data_dir=$DATA_DIR \ 
--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer   \
 --problem=taskmaster \ 
--model=transformer \
 --hparams_set=transformer_taskmaster \ 
--output_dir=$OUTDIR \
 --train_steps=500000  --worker_gpu=1 \
 --hparams_set=transformer_taskmaster

 and these are the metrics at step 500000 -

INFO:tensorflow:Saving dict for global step 500000: global_step = 500000,  
loss = 3.6279776,
 metrics-taskmaster/targets/accuracy = 0.39591935, 
 metrics-taskmaster/targets/accuracy_per_sequence = 0.0073260074, 
 metrics-taskmaster/targets/accuracy_top5 = 0.6327749, 
 metrics-taskmaster/targets/approx_bleu_score = 0.04866978, 
 metrics-taskmaster/targets/neg_log_perplexity = -3.5477512, 
 metrics-taskmaster/targets/rouge_2_fscore = 0.13960487, 
 metrics-taskmaster/targets/rouge_L_fscore = 0.17148268

Decoding on test set. The test set <inputs, targets> pair were created in the same way as mentioned above, i.e. only asking the model to generate assistant’s utterances given the dialog context so far.

 BEAM_SIZE=4
ALPHA=0.6 
t2t-decoder 
--data_dir=$DATA_DIR 
--problem=$PROBLEM 
--model=$MODEL 
--hparams_set=$HPARAMS 
--output_dir=$OUTDIR  
--t2t_usr_dir=$T2T_USR_DIR/taskmaster/trainer  --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA"   --decode_from_file=$DECODE_FILE

and then I use the moses-multi-bleu (https://github.com/google/seq2seq/blob/master/seq2seq/metrics/bleu.py) to compute the BELU score = 4.34.

Question 4 - Is the above training and eval configuration correct to reproduce the results?

However, if I use t2t-bleu script to get the BLEU score, I get BLEU_uncased = 5.99 BLEU_cased = 5.55

Quesiton 5 - Why is there a difference between the BLEU scores when using the above 2, and which one is correct?

Please let me know if you need additional information. Thanks

Feb 06 '20 02:02 jatinganhotra

Taskmaster Taskmaster copied to clipboard

Unable to reproduce results from paper (Table 3)

Taskmaster
Taskmaster copied to clipboard