diffpool
diffpool copied to clipboard
Cannot reproduce results in the paper
I have fixed some small bugs mentioned on the before issue and run the program. However, I cannot get the result mentioned in the paper. The results generated from running the code is not averaged over 10-fold (the paper mentions results are averaged from 10 folds cross-validation), so I have to average them to get the 10-fold cross-validation accuracy (best validation model of each fold is chosen for test), which is 47.74% for Enzyme, and 67.37% for DD. This does not match the results of the paper. Can you let us know how to reproduce the results from the paper?
Hi,
Although there is some variation in results, 47.74% for Enzymes seem way too low. Not sure what happened to your run as I retried and got the results that I reported, without any tuning. You can use, say any hidden-dim and output-dim between 30-64, assign-ratio=0.25 or 0.1 etc., optionally add --linkpred, --dropout, and you should be able to get 60%+ with all these options.
Also the main function in train.py calls benchmark_task_val, as described in paper.
Rex
Thanks for your reply! I have run the main function in train.py - benchmark_task_val(), but the max validation performance (there is no test set performance) overall training iterations enzymes are 43.55% and the result of DD is 78.02%. And hyper-parameters is what you showed in the source code. Could you provide a source code should contain a script(including hyper-parameters) that reproduces the result in the paper? Thanks!
python -m train --bmname=ENZYMES --assign-ratio=0.1 --hidden-dim=30 --output-dim=30 --cuda=1 --num-classes=6 --method=soft-assign
Got 63.7%
many other configs are possible
Thanks for your reply! I have modified the hyper-parameters. And I have run the main function in train.py - benchmark_task_val() for sevaral times. The result of DD and enzymes have improved. However, the result of DD is 79.52% and the result of enzymes is 56.78% which are lower than the results which you showed on the paper. What's more, I think you choose the max validation performance overall training iterations as evaluation method is incorrect.
As I said I'm confused about what has been tuned. The command I posted gave much higher results, as mentioned. ENZYMES gets 60+ even without any tuning. In general you don't even need to tune to get the results. Maybe you can try https://github.com/rusty1s/pytorch_geometric/tree/master/examples diffpool there. They should give similar results.
The val acc was consistent across all experiments, and has been adopted by GIN etc.. This is mainly due to small dataset size for some of the datasets. You can of course do other test acc experiments, but just need to make sure you are consistent in eval.
Hi @RexYing . I love your paper, I think this is a really cool method.. just wanted to query about how you measure performance for benchmarking.
Could you please clarify the process used here? As far as I can see, it goes as follows:
- 10-fold cross-validation
- for each fold record the the best validation score
- keep best validation accuracy from each trial and report the mean of these (although in the code it actually looks like reporting the max of the mean val acc for each fold?)
Is my understanding correct? or do you also use a separate test set for each fold based on the val scores?
Hi, your understanding is correct. Max of the mean is used and I didn't specify test in code. Maybe refer to #17 for a bit more detail?
thanks, sorry I did not see #17 - this has answered my question exactly!
Hi @RexYing , I am trying to run your code with the script provided in example.sh but like the OP I get results that do not match the paper. Sometimes, I get 0.48%, sometimes, 0.56% for the test accuracy (I am running benchmark_task and not benchmark_task_val so that I can see the test accuracy. There is no test set in benchmark_test_val). Do you have a way to solve this problem?
Hi, the accuracy reported is the mean of the validation accuracy over 10 cross validation runs. All baselines are run with hyperparam search and the same metric as well.