diffpool icon indicating copy to clipboard operation
diffpool copied to clipboard

Cannot reproduce results in the paper

Open Waterpine opened this issue 6 years ago • 10 comments

I have fixed some small bugs mentioned on the before issue and run the program. However, I cannot get the result mentioned in the paper. The results generated from running the code is not averaged over 10-fold (the paper mentions results are averaged from 10 folds cross-validation), so I have to average them to get the 10-fold cross-validation accuracy (best validation model of each fold is chosen for test), which is 47.74% for Enzyme, and 67.37% for DD. This does not match the results of the paper. Can you let us know how to reproduce the results from the paper?

Waterpine avatar Feb 11 '19 03:02 Waterpine

Hi,

Although there is some variation in results, 47.74% for Enzymes seem way too low. Not sure what happened to your run as I retried and got the results that I reported, without any tuning. You can use, say any hidden-dim and output-dim between 30-64, assign-ratio=0.25 or 0.1 etc., optionally add --linkpred, --dropout, and you should be able to get 60%+ with all these options.

Also the main function in train.py calls benchmark_task_val, as described in paper.

Rex

RexYing avatar Feb 14 '19 21:02 RexYing

Thanks for your reply! I have run the main function in train.py - benchmark_task_val(), but the max validation performance (there is no test set performance) overall training iterations enzymes are 43.55% and the result of DD is 78.02%. And hyper-parameters is what you showed in the source code. Could you provide a source code should contain a script(including hyper-parameters) that reproduces the result in the paper? Thanks!

Waterpine avatar Feb 18 '19 02:02 Waterpine

python -m train --bmname=ENZYMES --assign-ratio=0.1 --hidden-dim=30 --output-dim=30 --cuda=1 --num-classes=6 --method=soft-assign

Got 63.7%

many other configs are possible

RexYing avatar Feb 22 '19 00:02 RexYing

Thanks for your reply! I have modified the hyper-parameters. And I have run the main function in train.py - benchmark_task_val() for sevaral times. The result of DD and enzymes have improved. However, the result of DD is 79.52% and the result of enzymes is 56.78% which are lower than the results which you showed on the paper. What's more, I think you choose the max validation performance overall training iterations as evaluation method is incorrect.

Waterpine avatar Mar 03 '19 05:03 Waterpine

As I said I'm confused about what has been tuned. The command I posted gave much higher results, as mentioned. ENZYMES gets 60+ even without any tuning. In general you don't even need to tune to get the results. Maybe you can try https://github.com/rusty1s/pytorch_geometric/tree/master/examples diffpool there. They should give similar results.

The val acc was consistent across all experiments, and has been adopted by GIN etc.. This is mainly due to small dataset size for some of the datasets. You can of course do other test acc experiments, but just need to make sure you are consistent in eval.

RexYing avatar Mar 07 '19 21:03 RexYing

Hi @RexYing . I love your paper, I think this is a really cool method.. just wanted to query about how you measure performance for benchmarking.

Could you please clarify the process used here? As far as I can see, it goes as follows:

  • 10-fold cross-validation
  • for each fold record the the best validation score
  • keep best validation accuracy from each trial and report the mean of these (although in the code it actually looks like reporting the max of the mean val acc for each fold?)

Is my understanding correct? or do you also use a separate test set for each fold based on the val scores?

meltzerpete avatar Aug 12 '19 13:08 meltzerpete

Hi, your understanding is correct. Max of the mean is used and I didn't specify test in code. Maybe refer to #17 for a bit more detail?

RexYing avatar Aug 13 '19 14:08 RexYing

thanks, sorry I did not see #17 - this has answered my question exactly!

meltzerpete avatar Aug 13 '19 20:08 meltzerpete

Hi @RexYing , I am trying to run your code with the script provided in example.sh but like the OP I get results that do not match the paper. Sometimes, I get 0.48%, sometimes, 0.56% for the test accuracy (I am running benchmark_task and not benchmark_task_val so that I can see the test accuracy. There is no test set in benchmark_test_val). Do you have a way to solve this problem?

Livetrack avatar Aug 15 '19 21:08 Livetrack

Hi, the accuracy reported is the mean of the validation accuracy over 10 cross validation runs. All baselines are run with hyperparam search and the same metric as well.

RexYing avatar Aug 16 '19 12:08 RexYing