reparam-discrete-diffusion icon indicating copy to clipboard operation
reparam-discrete-diffusion copied to clipboard

Which decoding method works best

Open cecilialeo77 opened this issue 1 year ago • 3 comments

decoding approach --decoding-strategy reparam--<topk_mode>-。In your experiments, is the default decoding method necessarily worse than the one specified in the script? On different datasets, which decoding method's results did you choose as the final answer? Looking forward to your reply!

cecilialeo77 avatar May 16 '24 07:05 cecilialeo77

Hey, thanks for reaching out! In our experiments, we found the default decoding strategy generally underperforms compared to our approach for all tasks discussed in our paper. We've reported results using our improved decoding strategy. You can find the specific --decoding-strategy parameters for each task at the following links:

  • Machine translation tasks: https://github.com/HKUNLP/reparam-discrete-diffusion/blob/26ee286b281edc6284d74f809465b3e6d42507a6/fairseq/experiments/mt_generate.sh#L31-L45
  • Other tasks: https://github.com/HKUNLP/reparam-discrete-diffusion/blob/26ee286b281edc6284d74f809465b3e6d42507a6/fairseq/experiments/diffuseq_generate.sh#L63-65

Feel free to check them out and let me know if you have any more questions!

LZhengisme avatar May 18 '24 16:05 LZhengisme

Thank you for your response! The reason I have this question is that when I was reproducing your data results on the QQP and QG tasks, I found that the BLEU score of the default decoding strategy exceeded that of the specified decoding strategy. For example, in the QG task, I reproduced your results as follows: NUM_ITER: 10 --decoding-strategy default: avg BLEU score 0.17457566633838953 NUM_ITER: 10 --decoding-strategy reparam-uncond-stochastic5.0-cosine: avg BLEU score 0.17439646335154702 Do you have any suggestions? Should we use the decoding strategy with the higher BLEU score as the final result?

cecilialeo77 avatar May 19 '24 06:05 cecilialeo77

Thanks for the details! Yes, there might be a lot of variation for more open-ended generation scenarios, like the question generation task here. It’s not uncommon for the default decoding strategy to sometimes perform competitively on these tasks.

Given your findings, I'd suggest experimenting a bit more if you have time (e.g., replacing --argmax-decoding with a larger temperature, say --temperature 0.5; or tweaking parameters in uncond-stochastic5.0-cosine, to see whether the default decoding strategy consistently delivers higher BLEU scores. If so, it makes sense to use default decoding as the final strategy for your scenario. But with such close scores (0.1746 vs. 0.1744), continuing with the reparam strategy could still be a good choice.

Hope this helps! 😊

LZhengisme avatar May 20 '24 02:05 LZhengisme