selfmonitoring-agent icon indicating copy to clipboard operation
selfmonitoring-agent copied to clipboard

Potential reproducibility issue with PyTorch >1.0.0

Open chihyaoma opened this issue 4 years ago • 3 comments

Hi all,

Thank you so much for your interest in the project and the released code.

We made sure that the code can robustly reproduce the numbers we reported in the paper when released the code, and since then I have confirmed with several people who tried the code and they can also reproduce the results.

However, since the 2nd week in September, I started to receive a few emails reporting that they have an issue in reproducing the results either in the Self-Monitoring agent or the Regretful agent.

I decided to create this issue now so that people who are interested in the proposed method can run the code and continue their research with caution. Currently, I suspect this issue is due to version differences in PyTorch (or even other python/Cuda libraries that I am using) that cause unexpected behavior.

With the current conference deadlines, I expect myself to be able to start investigating this issue as early as the winter break (end of December).


Below are the experimental setups that I used for developing and releasing the code. I hope this would help to reproduce the results.

Code development:
PyTorch 0.4.1 CUDA: 9.2.148 Cudnn: 7104

I also tested it out on the following setting and made sure it can reproduce the results when releasing the code: PyTorch 1.0.0 CUDA: 10.0.130 Cudnn: 7401

chihyaoma avatar Nov 04 '19 05:11 chihyaoma

With CUDA 10.0.130, cudnn v7.6, PyTorch 1.1.0, I trined the model with the code you provided in readme on real data. I am getting as low as 0.438 highest val_unseen success rate. Do you know where the problem might lie?

ZhangTianrong avatar Apr 10 '20 11:04 ZhangTianrong

I can confirm that one needs to run PyTorch 1.0.0 (tested with python 3.6 + CUDA 10.0.130 + cuDNN 7401) and not PyTorch 1.6.0 (+ CUDA 10.0.130). It might be related to how PyTorch introduced torch.bool.

guhur avatar Sep 08 '20 18:09 guhur

I can confirm that one needs to run PyTorch 1.0.0 (tested with python 3.6 + CUDA 10.0.130 + cuDNN 7401) and not PyTorch 1.6.0 (+ CUDA 10.0.130). It might be related to how PyTorch introduced torch.bool.

Any explanation or related post to this torch.bool problem? I didn't find any useful information about this but I encountered similar issue.

PatZhuang avatar Nov 27 '20 13:11 PatZhuang