SAGPool icon indicating copy to clipboard operation
SAGPool copied to clipboard

Reproducibility of Table 3

Open SilvioGiancola opened this issue 5 years ago • 9 comments

Hi,

I am running your code several time on the DD dataset and obtain different results than the one you present on Table 3 of your paper. In particular I run 20 times this experiment, estimate the average and std but find out that the training is very random, with the std rising up to +-10%. Note that I used the same hyperparameters you provide in your paper and your google sheet (see issue #2). I also tried with the ReduceLROnPlateau scheduler for the LR, but still have an std up to 5%. How did you select your seed and how come there is such variation?

Thank you for your support,

Best,

SilvioGiancola avatar Jul 17 '19 09:07 SilvioGiancola

I can confirm with @SilvioGiancola that the variance is much larger than results on google sheet, by using hyperparameters as suggested on the dataset D&D. And the mean accuracy seems not that high.

ThyrixYang avatar Oct 14 '19 04:10 ThyrixYang

Hi @SilvioGiancola , they said in their paper that

In our experiments, we evaluated the pooling methods over 20 random seeds using 10-fold cross validation. A total of 200 testing results were used to obtain the final accuracy of each method on each dataset

So we evaluate by run CV (the main.py file) 10 times, calculate their mean, take this as one result, and repeat this procedure 20 times. I hope this helps.

ThyrixYang avatar Oct 15 '19 13:10 ThyrixYang

Hi @ThyrixYang , thank you for sharing this detail!

In that case, it's not exactly like running 10 times the code as the main.py is not doing 10-fold cross validation but random splitting.

With this averaging over 10 runs, did you get a variance similar than one in Table 3?

SilvioGiancola avatar Oct 20 '19 14:10 SilvioGiancola

@SilvioGiancola yes, the variance is similar, although its mean is a bit lower.

ThyrixYang avatar Oct 20 '19 14:10 ThyrixYang

@ThyrixYang I solved the variance issue with this 10-fold cross validation.

Although, when I reproduce their results, I am getting 10% lower than what they claim on the DD dataset, using the global pooling model. Are you also having such a big difference in your results?

I wished the authors could provide a code to reproduce their results. It is impossible to build upon them...

SilvioGiancola avatar Oct 22 '19 19:10 SilvioGiancola

@SilvioGiancola Are you doing exactly 10-fold CV? I remember on the random CV that it's about 2~3% lower, not 10% I'm working on a paper about graph pooling now, but not based mainly on this paper, I will do more experiments later, maybe we can share some results then.

ThyrixYang avatar Oct 23 '19 10:10 ThyrixYang

@ThyrixYang I'll be more than happy to share some results with you on this baseline. I guess I am performing the 10-fold CV properly, in particular:

  1. I randomly split the dataset (DD in my case) in 10 folds of same length (last fold has a slight different length)
  2. I use 9 folds for training, the 10th for testing
  3. I repeat the 9-fold training 10 times, segregating a different fold for the testing at each time
  4. I average the testing performance over the 10 folds -> this gives me the results for 1 run
  5. I repeat steps 1-4 for 20 times, with 20 different random splits for the folds
  6. I estimate the average and the std over the 20 runs

I get 65.1 ± 1.23 on DD using the global pooling setting, while the paper claim 76.19 ± 0.94. BTW I tried both the SAGPool implementations from this repo and from pytorch-geometric, with similar results. I also used the hyper parameters from the gsheet (lr=0.005, nhid=128, weight decay=0.00001)

Are you doing anything different for the 10 fold CV? Have you tried the same dataset or a different one?

SilvioGiancola avatar Oct 23 '19 13:10 SilvioGiancola

@ThyrixYang I'll be more than happy to share some results with you on this baseline. I guess I am performing the 10-fold CV properly, in particular:

  1. I randomly split the dataset (DD in my case) in 10 folds of same length (last fold has a slight different length)
  2. I use 9 folds for training, the 10th for testing
  3. I repeat the 9-fold training 10 times, segregating a different fold for the testing at each time
  4. I average the testing performance over the 10 folds -> this gives me the results for 1 run
  5. I repeat steps 1-4 for 20 times, with 20 different random splits for the folds
  6. I estimate the average and the std over the 20 runs

I get 65.1 ± 1.23 on DD using the global pooling setting, while the paper claim 76.19 ± 0.94. BTW I tried both the SAGPool implementations from this repo and from pytorch-geometric, with similar results. I also used the hyper parameters from the gsheet (lr=0.005, nhid=128, weight decay=0.00001)

Are you doing anything different for the 10 fold CV? Have you tried the same dataset or a different one?

Hi, I tried to reproduce the experiment too. How is your replication of the experiment now? I am looking forward to your reply. I am very interested in it Thank you!

jiaruHithub avatar Jun 19 '20 10:06 jiaruHithub

Hi, each time I run this code, I got different results. I have set these seeds but still got different results.

torch.cuda.manual_seed(12345)  
torch.cuda.manual_seed_all(12345)  
random.seed(12345)
np.random.seed(12345)
#torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

What can I do to get the sample after running this code each time? I am looking forward to your reply. Thank you!

Abelpzx avatar Jul 31 '21 12:07 Abelpzx