Unable to achieve published result in DailyDialogue
Hi, I am trying to retrain your model as a baseline, and till now SWDA gave the results as per the paper. actually, slightly better. But for the DailyDialog dataset, even after multiple runs the best we got is, (row1 is no validation, row2 on test set
A, E, G are for sim_bow BLEU-R | BLEU-P | F1 | A | E | G 0.305 | 0.170 | 0.218 | 0.940 | 0.609 | 0.857 0.298 | 0.163 | 0.211 | 0.940 | 0.605 | 0.857
Whereas the paper mentions the best results to be

Was there any changes made to the code with respect to the configuration in the paper? I couldn't find any discrepancy. Can you point me to what might be the issue?
Thanks for pointing out. There seems to be a big deviation to the original results since recently. Somebody reported better results than that reported in the paper for the DailyDial dataset. We are not sure whether it is due to any change of environment other than those written in the "requirements.txt". We are figuring it out and will let you know.
Ok, thanks. Although we were using an environment as per the requirements.txt only. Also like you said, we also noticed quite a bit of variance between different runs. (Even when the seed is given as an argument)
Also, in #2341 it was shown that the NLTK lib has some issues, related to the SmoothingFunction() and therefore received an update to fix it. Hence, it is no longer possible to achieved the same results.