TC-GNN_ATC23
TC-GNN_ATC23 copied to clipboard
Improper use of CUDA Graph
Improper use of CUDA Graph in TC-GNN
Hello,
I wanted to bring to your attention a potential issue regarding the usage of CUDA Graph in TC-GNN. Upon reviewing the torch document and tutorial, it appears that CUDA Graph is intended to capture and replay GPU kernels on a specified stream. However, I noticed that in TC-GNN, the manually-implemented kernels (e.g., TCGNN_conv/TCGNN_kernel.cu) are being called without setting the stream. As a result, these kernels are not being captured or replayed (executed) within the TC-GNN framework.
It seems that the speedup achieved through the utilization of CUDA Graph is a consequence of these kernels being ignored during execution. In fact, during my profiling, I observed that the performance was over five times faster than the non-CUDA Graph test that actually runs the kernels. Furthermore, the lower test accuracy experienced in #5 also supports this observation.
I kindly suggest considering some corrections to the tested results that involve the usage of CUDA Graph.
Thank you for your attention to this matter.
Hi, Thanks for reaching out!
Thanks for bringing this to our attention, our current observation is that the CUDA graph on PyToch seems to have some problem supporting kernel with dynamic array (e.g., edge_list
or row_ptr
) in GNN cases.
Here are the results we recently ran without the CUDA graph compared with DGL on RTX-3090.
The original conclusion for performance advantage over DGL still holds.
GCN-model | DGL | TC-GNN(w/o CUDA Graph) | Speedup (x) |
---|---|---|---|
citeseer | 7.27 | 3.75 | 1.94 |
cora | 7.05 | 3.68 | 1.92 |
pubmed | 7.34 | 3.74 | 1.96 |
ppi | 7.56 | 4.46 | 1.70 |
PROTEINS_full | 7.48 | 3.75 | 2.00 |
OVCAR-8H | 69.45 | 66.95 | 1.04 |
Yeast | 63.67 | 61.07 | 1.04 |
DD | 13.35 | 10.53 | 1.27 |
YeastH | 114.87 | 111.48 | 1.03 |
amazon0505 | 20.58 | 22.70 | 0.91 |
artist | 7.45 | 4.50 | 1.66 |
com-amazon | 16.70 | 16.69 | 1.00 |
soc-BlogCatalog | 7.56 | 9.41 | 0.80 |
amazon0601 | 19.58 | 19.55 | 1.00 |
Average | 1.38 |
AGNN-model | DGL | TC-GNN(w/o CUDA Graph) | Speedup (x) |
---|---|---|---|
citeseer | 31.25 | 10.31 | 3.03 |
cora | 31.08 | 10.34 | 3.01 |
pubmed | 31.38 | 10.59 | 2.96 |
ppi | 40.28 | 19.89 | 2.03 |
PROTEINS_full | 31.47 | 10.48 | 3.00 |
OVCAR-8H | 143.94 | 112.05 | 1.28 |
Yeast | 131.67 | 100.85 | 1.31 |
DD | 44.31 | 23.29 | 1.90 |
YeastH | 231.63 | 184.51 | 1.26 |
amazon0505 | 69.63 | 118.42 | 0.59 |
artist | 40.40 | 38.71 | 1.04 |
com-amazon | 50.67 | 41.60 | 1.22 |
soc-BlogCatalog | 50.72 | 81.73 | 0.62 |
amazon0601 | 61.05 | 47.42 | 1.29 |
Average | 1.75 |
We will soon update our current code repo to fix this error.