DST-as-Prompting
DST-as-Prompting copied to clipboard
On reproducing the experiment results in paper
Hi,
Congrats on being accepted in EMNLP 2021 as a concise and solid work! I am currently following your research and trying to reproduce the experimental results in the original paper using your codes. However, I have met some trouble in aligning the same JGA scores.
My experiments were all on MultiWOZ v2.2, with domain and slot descriptions. Here are my hyperparameter settings and corresponding results.
- T5-small, lr=5e-5, n_epoch=3, batchsize=8, JGA=55.3
- T5-base, lr=5e-5, n_epoch=3, batchsize=8, JGA=56.0
- T5-base, lr=5e-4, n_epoch=2, batchsize=16(bs=8 with grad_accumulation=2), JGA=56.1
- T5-base, lr=5e-4, n_epoch=2, batchsize=64(bs=8 with grad_accumulation=8), JGA=56.2 [Same as paper] The experiments were implemented on a single A100 40GB, with Python==3.9.12, PyTorch==1.12.1, CUDA==11.6, and the other hyperparameters remained default. There is still a gap between my results and the JGA score on paper, which is 57.6.
I am wondering if there is some other tricks to achieve a better results. If so, is it okay to share? So much appreciated! Looking forward to your reply :-D
Best
Hi, thanks for your interest! My best guess will be this is an optimization difference between training with "multiple machines" and "accumulating gradients within a single machine". For the T5-base, we used multi-GPUs and I honestly can't remember the exact configs we used.