neural-symbolic-machines icon indicating copy to clipboard operation
neural-symbolic-machines copied to clipboard

reproducing results paper

Open koenvanderveen opened this issue 6 years ago • 14 comments

Hi!, I was playing with your code, great work! I am trying to reproduce the results from your paper on WikiSQL. However, when using run.sh I get results in the 70.3 ballpark (on dev set) instead of the reported 72.2%. Are there any parameters I need to change to get the reported results?

Thanks in advance!

koenvanderveen avatar Dec 20 '18 13:12 koenvanderveen

Thanks for asking the question. The result in the paper is obtained using the default parameter in the repo on a AWS g3.xlarge machine.

There are 3 sources for the difference between experiments (and the sensitivity of RL training tends to amplify it): (1) The stochasticity in random seed. (2) The stochasticity in asynchronous training. (3) Different machine configuration. In my experience, sometimes even the same type of instances can have some difference due to the cloud.

But the difference you saw are larger than the standard deviation in my experiments, so I would also like to investigate it.

I am working on an update to fix (1) and (2) to make experiments more determinisitc. For (3), may I know the machine configuration your are using?

In README, I attached a picture of the learning curve of one run that reached 72.35% dev accuracy on WikiSQL. If it helps, I can also share with you the full tensorboard log and the saved best model from some more recent experiment.

crazydonkey200 avatar Dec 21 '18 16:12 crazydonkey200

Thanks for your quick response! I used a AWS g3.xlarge. I tried multiple times but I do get consistent results around 70.3.

koenvanderveen avatar Dec 21 '18 17:12 koenvanderveen

Thanks for the input. I will try starting some new AWS instances to see if I can replicate the issue. In the meantime, here's a link to the data of a recent run that reached 72.2% dev accuracy. The tensorboard log is in the tb_log subfolder, and the best model is saved in the best_model subfolder.

crazydonkey200 avatar Dec 22 '18 16:12 crazydonkey200

Thanks, I'd love to find out where the difference originates from. I downloaded the repo again to make sure I did not make any changes and ran again, but reached the same result. The only thing I had to change to make it work is replacing (line 70 table/utils.py) :

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError): 
  val = val.lower() 

with

try: 
  val = babel.numbers.parse_decimal(val) 
except (babel.numbers.NumberFormatError, UnicodeEncodeError): 
  val = val.lower() 

Due to errors like this UnicodeEncodeError: 'decimal' codec can't encode character u'\u2013' in position 1: invalid decimal Unicode string

Do you think that might be the reason? And if so, do you have any idea how to prevent catching those errors?

koenvanderveen avatar Dec 29 '18 16:12 koenvanderveen

Sorry for the late reply. I have added your change into the codebase and rerun the experiments on two new AWS instances. The mean and std from 3 experiments (each averages 5 runs) are 71.92+-0.21%, 71.97+-0.17%, 71.93+-0.38%. You can also download all the data for these 3 experiments here 1 2 3.

I am also curious about the reason of the difference. I have added a new branch named fix_randomization to make the results more reproducible by controlling the random seeds. Would you like to try running the experiments again using the new branch on a AWS instance and let me know if anything changes?

Thanks.

crazydonkey200 avatar Jan 09 '19 11:01 crazydonkey200

Hi! i ran the experiments again in the fix_randomization branch, but it did not result in different results (still around 70%). Did you re-download the data before running the experiments? I cannot think of any other form of randomness at this point but the difference is quite consistent.

koenvanderveen avatar Jan 28 '19 15:01 koenvanderveen

Oke, I finally found the source of the difference. I used a newer version of the Deep Learning AMI in AWS, i ran the experiments with v10 now and got the same results (around 71.7) . Would be interesting to know which operations are changed.

koenvanderveen avatar Jan 30 '19 10:01 koenvanderveen

Thanks for reporting this and for running the experiments to confirm it!

That's interesting. I would also like to look into this. So what is the newer version of Deep Learning AMI you used, is it Deep Learning AMI (Ubuntu) Version 21.0 - ami-0b294f219d14e6a82? And how do you launch instances with previous versions, for example v10? Thanks!

crazydonkey200 avatar Jan 31 '19 09:01 crazydonkey200

Hi there :-)

I'm trying to replicate the results of WikiTableQuestions. I tried Tensorflow v1.12.0 (Deep Learning AMI 21.0) and v1.8.0 (Deep Learning AMI 10.0). The corresponding accuracies are 41.12% for v1.12.0 and 43.27% for v1.8.0. It looks like the difference is because of Tensorflow version.

Also, is the current settings in run.sh the one that was used to produce the learning curve in the image? The number of steps was set to 25,000. While in the picture, the number of steps is around 30,000. Also, the max_n_mem was set to 60, which caused Not enough memory slots for example.... I changed it to 100, but I'm not sure if it is the right thing to do? Thanks!

dungtn avatar Feb 02 '19 00:02 dungtn

Hi, thanks for the information :) I will run some experiments to compare TF v1.12.0 vs v1.8.0.

The current setting in run.sh is used to produce the result in the paper. The image is produced from an old setting that trains for 30k steps. Thanks for pointing it out. I will replace the image with a run under the current setting.

The max_n_mem was set to 60 for the sake of speed. When the table is large and requires more than 60 memory slots, some columns will be dropped (reason for the Not enough memory slots for example... warning). Changing it to 100 would probably achieve a similar or better result because no columns will be dropped, but the training will be slower.

crazydonkey200 avatar Feb 02 '19 08:02 crazydonkey200

As an update, I have created a branch reproducible that can run training deterministically. Because it is hard to make Tensorflow deterministic when using GPU (see here for more info) and when running with multiprocessing, this branch uses only 1 trainer and 1 actor for training, so the training is very slow (takes about 44hrs to finish one training, which only takes 2-3hrs in the master branch). This branch is using tensorflow-gpu==1.12.0.

This setting gets slightly lower results on WikiTable (41.51+-0.19% dev accuracy, 42.78+-0.77% test accuracy). Below is the command to reproduce the experiments (after pulling the latest version of the repo):

git checkout reproducible
cd ~/projects/neural-symbolic-machines/table/wtq/
./run_experiments.sh run_rpd.sh mapo mapo_rpd

crazydonkey200 avatar Mar 08 '19 07:03 crazydonkey200

Can you add more details about dataset preprocessing? For example, how to generate the all_train_saved_programs.json file?

dungtn avatar Mar 29 '19 21:03 dungtn

Where do you get stop_words.json?

guotong1988 avatar May 30 '19 12:05 guotong1988

@dungtn Here's a detailed summary created by another researcher on how to replicate the preprocessing and experiments starting from the raw WikiTableQuestions dataset and how to adapt the code to other similar datasets. Also added the link to this summary into the readme.

@guotong1988 Unfortunately I don't remember where exactly I got the list ofstop_words.json, but it seems to be a subset of the nltk stop words, for example found here.

crazydonkey200 avatar Mar 31 '21 01:03 crazydonkey200

Hello, when I run the code, I find that it will stop at a random time. After I check, I think it is due to the multiprocessing of actors. These multiprocessing will stop at a random time. Can you help me? Thank you.

qishi21 avatar Nov 17 '22 03:11 qishi21