gae icon indicating copy to clipboard operation
gae copied to clipboard

Better mask_test_edges function

Open stefanosantaris opened this issue 5 years ago • 6 comments

@tkipf This is my fix for a little bit more performant mask_test_edges function. I managed to reproduce the results of the paper and it works much better even with large graphs. The only time consuming processes in this function are the assertions at the end of the function.

stefanosantaris avatar Dec 31 '19 09:12 stefanosantaris

Thanks -- I'll leave this open for now in case someone is interested in a (working) example for a more efficient implementation. I rolled the master branch back to the original version before @philipjackson's PR to keep it in line with the paper.

tkipf avatar Jan 03 '20 13:01 tkipf

Dear all,

Contrary to previous comments (here + #54), I was able to reproduce all results from @tkipf's original paper using @philipjackson's implementation (see #25) of the mask_test_edges function.

I suspect that previous issues simply come from different train/validation/test splits. Indeed, @philipjackson set default parameters values to test_percent=30. and val_percent=20., whereas @tkipf used test_percent=10. and val_percent=5. in his experiments. So, @philipjackson masks more edges from the training graph w.r.t. original experiments, which leads to lower performances. With corrected parameters, I reach consistent results w.r.t. the paper.

Moreover, for the PubMed dataset, i.e. the largest one, @stefanosantaris's implementation runs in 3+ minutes on my laptop. @philipjackson's implementation runs in 0.03 seconds, and in a few seconds for a graph with 1 million nodes (I removed all assert lines for both functions).

As a consequence, I would recommend to use #25 with updated default parameters. :)

GuillaumeSalhaGalvan avatar Jan 07 '20 20:01 GuillaumeSalhaGalvan

Dear all,

Contrary to previous comments (here + #54), I was able to reproduce all results from @tkipf's original paper using @philipjackson's implementation (see #25) of the mask_test_edges function.

I suspect that previous issues simply come from different train/validation/test splits. Indeed, @philipjackson set default parameters values to test_percent=30. and val_percent=20., whereas @tkipf used test_percent=10. and val_percent=5. in his experiments. So, @philipjackson masks more edges from the training graph w.r.t. original experiments, which leads to lower performances. With corrected parameters, I reach consistent results w.r.t. the paper.

Moreover, for the PubMed dataset, i.e. the largest one, @stefanosantaris's implementation runs in 3+ minutes on my laptop. @philipjackson's implementation runs in 0.03 seconds, and in a few seconds for a graph with 1 million nodes (I removed all assert lines for both functions).

As a consequence, I would recommend to use #25 with updated default parameters. :)

Hi, @GuillaumeSalha . Can you reproduce the results in the paper with test_percent=10, val_percent=5 with the original implementation /updated implementation? I still cannot reproduce it with the original one. Sad...

haorannlp avatar Jan 09 '20 13:01 haorannlp

Hi @haorannlp ! After changing test_percent=10. and val_percent=5. in preprocessing.py, you need to run again:

cd .. 
python setup.py install

Then, indeed, you should be able to reproduce results from the paper.

GuillaumeSalhaGalvan avatar Jan 09 '20 13:01 GuillaumeSalhaGalvan

Hi everyone,

I think what happened here is that I wrote this code along with @sbonner0 for use in a paper of our own, in which we used different sized val and test splits, and only submitted it as a pull request here as an afterthought. That's why my default val_percent and test_percent don't match up with @tkipf's originals, I didn't think to revert them when I made the pull request. Apologies for the inconvenience caused, and thanks to @GuillaumeSalha for spotting the issue!

philipjackson avatar Jan 09 '20 16:01 philipjackson

Thank you buddy, @GuillaumeSalha! I can reproduce the results now.

haorannlp avatar Jan 11 '20 02:01 haorannlp