GNNAdvisor_OSDI21 icon indicating copy to clipboard operation
GNNAdvisor_OSDI21 copied to clipboard

Question: warp-based thread alignment

Open sdkjksfd opened this issue 3 years ago • 5 comments

Hi,

Your work - GNNAdvisor, is fantastic and many thanks for creating this repository to share the experiment source code of GNNAdvisor.

I have one doubt on warp-based thread alignment. In section4.3 of GNNAdvisor OSDI's paper, you said, "warp-aligned thread mapping can merge memory requests from the same warp into one global memory transaction." How to achieve the warp-aligned actions in CUDA: changing the number of warps per block or other thread scheduling/orchestration? Could you please give some guidance on where the source code is and how it works? Thanks for your time and help.

Best wishes,

sdkjksfd avatar Jul 09 '21 13:07 sdkjksfd

Thanks for your interest in our project!

For warp-aligned design, we basically assign each warp to process a neighbor partition (consisting of fixed number of neighbors and a target node), you can check these parts of code. https://github.com/YukeWang96/OSDI21_AE/blob/3a99666895108c5ed4ddae371d12a40fd1b2a4bc/GNNAdvisor/GNNConv/GNNAdvisor_kernel.cu#L287-L289

https://github.com/YukeWang96/OSDI21_AE/blob/3a99666895108c5ed4ddae371d12a40fd1b2a4bc/GNNAdvisor/GNNConv/GNNAdvisor_kernel.cu#L350-L354

Note that

  • Instead of using the conventional CSR graph format, we build the partpointer list in place of nodepointer list to index the neighbor range of each partition/warp.
  • Another thing is if the number of neighbors for one target node is smaller than the pre-defined size of partition, we still make all of its neighbors as the one partition (i.e., there is no partition that spaning across different target nodes). This guarantees that each warp will only need to aggregate to one target node to avoid global synchronization.

YukeWang96 avatar Jul 10 '21 17:07 YukeWang96

Hi @YukeWang96,

About the warp and dimension worker, I have another issue.

Figure 11(c) in your paper shows that the the number of dimension worker is likely to be set to 16 often. With "each warp to process a neighbor partition", does it mean that 16 threads are working while the other 16 threads in the same warp are idle? Further, would it be more efficient if one warp processes two partitions in this situation?

Thanks for your time Sincerely

xsxhust avatar Jul 16 '21 03:07 xsxhust

That's is a good question. I haven't tried the design you mention, it might help in some cases. e.g., when the partition size is small and neighbors from two partitions will be aggregated to the same target node. However, if two partitions belong to two different target nodes the design you mention may not be that efficient because of additional global memory access for the second target node embedding.

YukeWang96 avatar Jul 16 '21 03:07 YukeWang96

Sorry for a mistake in the last post: the Figure 11(c) is the figure in your arxiv paper, and it should be Figure 12(b) in the OSDI paper.

For the idea of letting one warp processing two partitions, we did a small test.

We change you code https://github.com/YukeWang96/OSDI21_AE/blob/3a99666895108c5ed4ddae371d12a40fd1b2a4bc/GNNAdvisor/GNNConv/GNNAdvisor_kernel.cu#L8 into #define WARP_SIZE dimWorker

We print the dimWorker to verify that it is either 32 or 16. This ensures that the threads processing the same partition are not divided into different warps. But we don't know if there is any other side effect caused by the change.

We run the GCN on all the five "Type Ⅲ" graphs and get the following results:

dataset forward aggregation time (sec) speedup
warp=32 warp=dimworker
amazon0505 0.642692 0.519317 1.237572
artist 0.143156 0.121706 1.176244
com-amazon 0.323492 0.27105 1.193477
soc-BlogCatalog 0.265914 0.203573 1.306234
amazon0601 0.659007 0.514047 1.281998

The "forward aggregation time" is the "spmm_forward_cuda_kernel" kernel time with 100 epoch, reported by nvprof.

Can this suggest that the benefit from memory access does not offset the problem of idle threads (the wast of computing resource)?

xsxhust avatar Jul 16 '21 09:07 xsxhust

From your test result, I think it is as you suggest. I believe more cases such as

1) more than two partitions per warp; 
2) more neighbors per partition. 

would also be helpful to get more insights into such tradeoffs.

YukeWang96 avatar Jul 16 '21 23:07 YukeWang96