rccl icon indicating copy to clipboard operation
rccl copied to clipboard

Make full use of NIC

Open clearsky07 opened this issue 2 years ago • 15 comments

I found a bug in the NIC selection in rccl, which I have detailed in the following documentation. I've made a few improvements to make better use of the NIC when using "NCCL_MIN_NCHANNELS". make full use of NIC.docx

clearsky07 avatar Aug 07 '23 09:08 clearsky07

I don't know why so many tests failed.The added code can be run in my RCCL library,and achieved good performance.You can read the DOC document I wrote for further analysis.

clearsky07 avatar Aug 09 '23 01:08 clearsky07

Thanks - we'll need some more time for analysis and testing.

gilbertlee-amd avatar Aug 14 '23 22:08 gilbertlee-amd

The PR is consistently failing our CI tests - we're still investigating.

gilbertlee-amd avatar Aug 22 '23 19:08 gilbertlee-amd

@clearsky07 - In the attached doc, you indicated that you observed only two NICs (out of four) are getting used. How many ranks did you run per node? What is the number of GPUs per node?

nusislam avatar Aug 22 '23 20:08 nusislam

@clearsky07 - In the attached doc, you indicated that you observed only two NICs (out of four) are getting used. How many ranks did you run per node? What is the number of GPUs per node?

I'm sorry I didn't state my test clearly.The ranks I run per node is 4 , and the number of GPUs per node is 4 too. Actually , I ran on different scale many times on my machine.Here are my test results , each node runs 4 GPUs(4 ranks) , has 4 NICs , nodes are connected by InfiniBand network(25GB/s).I use rccl-test all_gather_perf as my benchmark and uses bus bandwidth as result.The scan size range from 1MB to 1GB.

PS: My changes don't work for sendrecv - maybe it is because sendrecv doesn't work on RING? Thanks for helping me analyze these. test.xlsx

clearsky07 avatar Aug 23 '23 01:08 clearsky07

The PR is consistently failing our CI tests - we're still investigating.

Thanks a lot. I can only run on my specific machine , and got some performance improvement on collective communication by runing rccl-tests. So I don't know whether my change still works on other machines , and I hope you can help me test it out.

clearsky07 avatar Aug 23 '23 01:08 clearsky07

@clearsky07 - In the attached doc, you indicated that you observed only two NICs (out of four) are getting used. How many ranks did you run per node? What is the number of GPUs per node?

I'm sorry I didn't state my test clearly.The ranks I run per node is 4 , and the number of GPUs per node is 4 too. Actually , I ran on different scale many times on my machine.Here are my test results , each node runs 4 GPUs(4 ranks) , has 4 NICs , nodes are connected by InfiniBand network(25GB/s).I use rccl-test all_gather_perf as my benchmark and uses bus bandwidth as result.The scan size range from 1MB to 1GB.

PS: My changes don't work for sendrecv - maybe it is because sendrecv doesn't work on RING? Thanks for helping me analyze these. test.xlsx

I could not reproduce the issue that you indicated in RCCL. I ran a RCCL allgather test on 2 nodes. On each node I run 4 ranks (4 GPUs and 4 NICs) and I see each of the ranks on a node using a different NIC.

nusislam avatar Aug 23 '23 14:08 nusislam

@clearsky07 - In the attached doc, you indicated that you observed only two NICs (out of four) are getting used. How many ranks did you run per node? What is the number of GPUs per node?

I'm sorry I didn't state my test clearly.The ranks I run per node is 4 , and the number of GPUs per node is 4 too. Actually , I ran on different scale many times on my machine.Here are my test results , each node runs 4 GPUs(4 ranks) , has 4 NICs , nodes are connected by InfiniBand network(25GB/s).I use rccl-test all_gather_perf as my benchmark and uses bus bandwidth as result.The scan size range from 1MB to 1GB. PS: My changes don't work for sendrecv - maybe it is because sendrecv doesn't work on RING? Thanks for helping me analyze these. test.xlsx

I could not reproduce the issue that you indicated in RCCL. I ran a RCCL allgather test on 2 nodes. On each node I run 4 ranks (4 GPUs and 4 NICs) and I see each of the ranks on a node using a different NIC.

Can you explain what output information or tools you used to determine that RCCL used 4 NICs? My "NCCL_DEBUG=INFO" showed "NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_3:1/IB", but the info like"NCCL INFO Channel 01/0 : 3[63000] -> 4[4000] [receive] via NET/IB/0/GDRDMA comm 0x2ae490000ab0 nRanks 08" showed there is no use of NIC1 and NIC 2. When it comes to collective communication, can I say ,"Which NIC is called when the channel is generated, then the channel will only use this NIC in the future"?

clearsky07 avatar Aug 23 '23 14:08 clearsky07

The PR is consistently failing our CI tests - we're still investigating.

Hi,is there any progress?

clearsky07 avatar Sep 04 '23 10:09 clearsky07

@clearsky07 - We will have to lookup the test failures. In the meantime, can you explain at a high level what you are trying to achieve in this PR compared to the existing design?

nusislam avatar Sep 07 '23 21:09 nusislam

@clearsky07 - We will have to lookup the test failures. In the meantime, can you explain at a high level what you are trying to achieve in this PR compared to the existing design?

Sure , I'll tell you the whole process of my research from beginning to end. In the beginning , when I run rccl-tests , I found the NCCL_DEBUG=INFO out put the message like: b05r4n19:18361:18429 [3] NCCL INFO Channel 00/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 01/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 02/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 03/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 04/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 05/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 there is no use of IB 1 and IB 2 when generating Channel.So I started to think , Is it possible that the channel uses which NIC it calls during generation and which NIC it uses during data transmission? I asked @sjeaugey what graph->inter[] means , and he answered me graph->inter is the list of NICs (NIC to enter the node, NIC to exit the node). I out put graph->inter[]=0,3,0,3 , which means it truly didn't use IB 1 and IB 2 , all channels will only use IB 0 and IB 3. So I added the code to rewrite graph->inter[] to make sure RCCL uses all NIC detcted averagely. Can you add the code to your own RCCL and build from src to see whether the code is useful?

clearsky07 avatar Sep 08 '23 02:09 clearsky07

@clearsky07 - We will have to lookup the test failures. In the meantime, can you explain at a high level what you are trying to achieve in this PR compared to the existing design?

Sure , I'll tell you the whole process of my research from beginning to end. In the beginning , when I run rccl-tests , I found the NCCL_DEBUG=INFO out put the message like: b05r4n19:18361:18429 [3] NCCL INFO Channel 00/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 01/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 02/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 03/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 04/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 05/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 there is no use of IB 1 and IB 2 when generating Channel.So I started to think , Is it possible that the channel uses which NIC it calls during generation and which NIC it uses during data transmission? I asked @sjeaugey what graph->inter[] means , and he answered me graph->inter is the list of NICs (NIC to enter the node, NIC to exit the node). I out put graph->inter[]=0,3,0,3 , which means it truly didn't use IB 1 and IB 2 , all channels will only use IB 0 and IB 3. So I added the code to rewrite graph->inter[] to make sure RCCL uses all NIC detcted averagely. Can you add the code to your own RCCL and build from src to see whether the code is useful?

@clearsky07 - We will have to lookup the test failures. In the meantime, can you explain at a high level what you are trying to achieve in this PR compared to the existing design?

Sure , I'll tell you the whole process of my research from beginning to end. In the beginning , when I run rccl-tests , I found the NCCL_DEBUG=INFO out put the message like: b05r4n19:18361:18429 [3] NCCL INFO Channel 00/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 01/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 02/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 03/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 04/0 : 7[63000] -> 0[4000] [send] via NET/IB/1 comm 0x2209a60 nRanks 08 b05r4n19:18361:18429 [3] NCCL INFO Channel 05/0 : 7[63000] -> 0[4000] [send] via NET/IB/3 comm 0x2209a60 nRanks 08 there is no use of IB 1 and IB 2 when generating Channel.So I started to think , Is it possible that the channel uses which NIC it calls during generation and which NIC it uses during data transmission? I asked @sjeaugey what graph->inter[] means , and he answered me graph->inter is the list of NICs (NIC to enter the node, NIC to exit the node). I out put graph->inter[]=0,3,0,3 , which means it truly didn't use IB 1 and IB 2 , all channels will only use IB 0 and IB 3. So I added the code to rewrite graph->inter[] to make sure RCCL uses all NIC detcted averagely. Can you add the code to your own RCCL and build from src to see whether the code is useful?

The output that you have here is for a single rank. I understand that Rank-3 is using only IB-1 and IPB-3. What about the other ranks on that node? Do they use other NICs?

Also from the output that you shared here, Rank-3 is using IB-1 and IB-3. why is graph->inter[]=0,3,0,3 (not 1,3,1,3)?

nusislam avatar Sep 08 '23 14:09 nusislam

@clearsky07 - We will have to lookup the test failures. In the meantime, can you explain at a high level what you are trying to achieve in this PR compared to the existing design?

Regarding the test failure, Allgather is rccl unit test is failing. RCCL version 2.18.3+hip5.7 HEAD:eb260db [ ERROR ] Child 0 pipe closed unexpectedly script returned exit code 1

nusislam avatar Sep 08 '23 20:09 nusislam

@nusislam I am sorry, there is two versions of RCCL on my machine:rccl-develop and rccl with this PR. Maybe I used the wrong version --rccl with this PR to describe the problem. Anyway,there is my test case and out put when using rccl-develop and rccl with this PR in the two markdown files,2 nodes with total 8 GPUruning rccl-tests-master all_gather_perf . RCCL-develop test result.md Rccl-with this PR test result.md You can analyze the two files by searching "net/IB/",and find out how many times each NIC is used, you can see the increase in performance. In the last,The purpose of this PR is not to incorporate it into the RCCL version, I just describe the problem and propose a simple solution. You can try to perfect it.By the way, I only verified this code on my multi-NIC host to improve performance. I hope you can help me verify it on other hosts and let me know the results.

clearsky07 avatar Sep 09 '23 08:09 clearsky07

@clearsky07 - Have you run the RCCL unit tests (https://github.com/ROCmSoftwarePlatform/rccl/tree/develop/test) with your PR? Allgather unit test is failing in our CI. If your PR is not applicable for all tests then you need to enable your changes through an environment variable.

Also please share NIC usage for at least two ranks on a node using the default design and your PR.

nusislam avatar Sep 11 '23 14:09 nusislam

Closing for now.

gilbertlee-amd avatar May 15 '24 15:05 gilbertlee-amd