oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

[Question]: 多机多卡节点内通信方式的疑问

Open DarrenYing opened this issue 8 months ago • 0 comments

Description

4个节点,每个节点上4个GPU,我做了一组对比实验 配置一:节点内dp,节点间pp(流水线)

P_STAGE1 = flow.placement("cuda", ranks=[0, 1, 2, 3])
P_STAGE2 = flow.placement("cuda", ranks=[4, 5, 6, 7])
P_STAGE3 = flow.placement("cuda", ranks=[8, 9, 10, 11])
P_STAGE4 = flow.placement("cuda", ranks=[12, 13, 14, 15])

截取部分NCCL日志如下:

50e30e9f1c94:30195:30454 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
50e30e9f1c94:30193:30458 [0] NCCL INFO Channel 00/02 :    0   1   2   3
50e30e9f1c94:30193:30458 [0] NCCL INFO Channel 01/02 :    0   1   2   3
50e30e9f1c94:30193:30458 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
50e30e9f1c94:30196:30456 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
50e30e9f1c94:30193:30458 [0] NCCL INFO Channel 00 : 0[31000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30194:30455 [1] NCCL INFO Channel 00 : 1[4b000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30193:30458 [0] NCCL INFO Channel 01 : 0[31000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30194:30455 [1] NCCL INFO Channel 01 : 1[4b000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30195:30454 [2] NCCL INFO Channel 00 : 2[b1000] -> 3[ca000] via direct shared memory
50e30e9f1c94:30196:30456 [3] NCCL INFO Channel 00 : 3[ca000] -> 0[31000] via direct shared memory
50e30e9f1c94:30195:30454 [2] NCCL INFO Channel 01 : 2[b1000] -> 3[ca000] via direct shared memory
50e30e9f1c94:30196:30456 [3] NCCL INFO Channel 01 : 3[ca000] -> 0[31000] via direct shared memory
50e30e9f1c94:30193:30458 [0] NCCL INFO Connected all rings
50e30e9f1c94:30194:30455 [1] NCCL INFO Connected all rings
50e30e9f1c94:30195:30454 [2] NCCL INFO Connected all rings
50e30e9f1c94:30196:30456 [3] NCCL INFO Connected all rings
50e30e9f1c94:30196:30456 [3] NCCL INFO Channel 00 : 3[ca000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30196:30456 [3] NCCL INFO Channel 01 : 3[ca000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30194:30455 [1] NCCL INFO Channel 00 : 1[4b000] -> 0[31000] via direct shared memory
50e30e9f1c94:30194:30455 [1] NCCL INFO Channel 01 : 1[4b000] -> 0[31000] via direct shared memory
50e30e9f1c94:30193:30458 [0] NCCL INFO Connected all trees
50e30e9f1c94:30193:30458 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
50e30e9f1c94:30193:30458 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
50e30e9f1c94:30195:30454 [2] NCCL INFO Channel 00 : 2[b1000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30195:30454 [2] NCCL INFO Channel 01 : 2[b1000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30195:30454 [2] NCCL INFO Connected all trees
50e30e9f1c94:30195:30454 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
50e30e9f1c94:30195:30454 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
50e30e9f1c94:30194:30455 [1] NCCL INFO Connected all trees
50e30e9f1c94:30194:30455 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
50e30e9f1c94:30194:30455 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
50e30e9f1c94:30196:30456 [3] NCCL INFO Connected all trees
50e30e9f1c94:30196:30456 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
50e30e9f1c94:30196:30456 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
50e30e9f1c94:30195:30454 [2] NCCL INFO comm 0x7f16befc8780 rank 2 nranks 4 cudaDev 2 busId b1000 - Init COMPLETE
50e30e9f1c94:30193:30458 [0] NCCL INFO comm 0x7fa84f0c8a60 rank 0 nranks 4 cudaDev 0 busId 31000 - Init COMPLETE
50e30e9f1c94:30194:30455 [1] NCCL INFO comm 0x7f05f6fc8340 rank 1 nranks 4 cudaDev 1 busId 4b000 - Init COMPLETE
50e30e9f1c94:30193:30458 [0] NCCL INFO Launch mode Parallel
50e30e9f1c94:30196:30456 [3] NCCL INFO comm 0x7fe546fd0320 rank 3 nranks 4 cudaDev 3 busId ca000 - Init COMPLETE
50e30e9f1c94:30195:30646 [2] NCCL INFO Setting affinity for GPU 2 to fff0,00fff000
50e30e9f1c94:30194:30664 [1] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff
50e30e9f1c94:30196:30647 [3] NCCL INFO Setting affinity for GPU 3 to fff0,00fff000
50e30e9f1c94:30193:30645 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
50e30e9f1c94:30193:30645 [0] NCCL INFO Channel 00/02 :    0   1   2   3
50e30e9f1c94:30196:30647 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
50e30e9f1c94:30193:30645 [0] NCCL INFO Channel 01/02 :    0   1   2   3
50e30e9f1c94:30193:30645 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
50e30e9f1c94:30194:30664 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
50e30e9f1c94:30195:30646 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
50e30e9f1c94:30196:30647 [3] NCCL INFO Channel 00 : 3[ca000] -> 0[31000] via direct shared memory
50e30e9f1c94:30196:30647 [3] NCCL INFO Channel 01 : 3[ca000] -> 0[31000] via direct shared memory
50e30e9f1c94:30194:30664 [1] NCCL INFO Channel 00 : 1[4b000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30194:30664 [1] NCCL INFO Channel 01 : 1[4b000] -> 2[b1000] via direct shared memory
50e30e9f1c94:30193:30645 [0] NCCL INFO Channel 00 : 0[31000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30195:30646 [2] NCCL INFO Channel 00 : 2[b1000] -> 3[ca000] via direct shared memory
50e30e9f1c94:30193:30645 [0] NCCL INFO Channel 01 : 0[31000] -> 1[4b000] via direct shared memory
50e30e9f1c94:30195:30646 [2] NCCL INFO Channel 01 : 2[b1000] -> 3[ca000] via direct shared memory
50e30e9f1c94:30194:30664 [1] NCCL INFO Connected all rings
50e30e9f1c94:30196:30647 [3] NCCL INFO Connected all rings
50e30e9f1c94:30195:30646 [2] NCCL INFO Connected all rings
50e30e9f1c94:30193:30645 [0] NCCL INFO Connected all rings

配置二:节点内pp,节点间dp

P_STAGE1 = flow.placement("cuda", ranks=[0, 4, 8, 12])
P_STAGE2 = flow.placement("cuda", ranks=[1, 5, 9, 13])
P_STAGE3 = flow.placement("cuda", ranks=[2, 6, 10, 14])
P_STAGE4 = flow.placement("cuda", ranks=[3, 7, 11, 15])

截取部分NCCL日志如下:

50e30e9f1c94:32432:32694 [1] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 00/0 : 3[4b000] -> 0[4b000] [receive] via NET/Socket/0
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 01/0 : 3[4b000] -> 0[4b000] [receive] via NET/Socket/0
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 00/0 : 0[4b000] -> 1[4b000] [send] via NET/Socket/0
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 01/0 : 0[4b000] -> 1[4b000] [send] via NET/Socket/0
50e30e9f1c94:32433:32692 [2] NCCL INFO Setting affinity for GPU 2 to fff0,00fff000
50e30e9f1c94:32432:32694 [1] NCCL INFO Connected all rings
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 00/02 :    0   1   2   3
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 01/02 :    0   1   2   3
50e30e9f1c94:32433:32692 [2] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1
50e30e9f1c94:32434:32693 [3] NCCL INFO Setting affinity for GPU 3 to fff0,00fff000
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 00/02 :    0   1   2   3
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 01/02 :    0   1   2   3
50e30e9f1c94:32434:32693 [3] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1
50e30e9f1c94:32431:32696 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 00/02 :    0   1   2   3
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 01/02 :    0   1   2   3
50e30e9f1c94:32431:32696 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] -1/-1/-1->0->1
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 00/0 : 3[b1000] -> 0[b1000] [receive] via NET/Socket/0
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 01/0 : 3[b1000] -> 0[b1000] [receive] via NET/Socket/0
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 00/0 : 0[b1000] -> 1[b1000] [send] via NET/Socket/0
50e30e9f1c94:32433:32692 [2] NCCL INFO Channel 01/0 : 0[b1000] -> 1[b1000] [send] via NET/Socket/0
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 00/0 : 3[ca000] -> 0[ca000] [receive] via NET/Socket/0
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 01/0 : 3[ca000] -> 0[ca000] [receive] via NET/Socket/0
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 00/0 : 0[ca000] -> 1[ca000] [send] via NET/Socket/0
50e30e9f1c94:32434:32693 [3] NCCL INFO Channel 01/0 : 0[ca000] -> 1[ca000] [send] via NET/Socket/0
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 00/0 : 3[31000] -> 0[31000] [receive] via NET/Socket/0
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 01/0 : 3[31000] -> 0[31000] [receive] via NET/Socket/0
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 00/0 : 0[31000] -> 1[31000] [send] via NET/Socket/0
50e30e9f1c94:32431:32696 [0] NCCL INFO Channel 01/0 : 0[31000] -> 1[31000] [send] via NET/Socket/0
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 00/0 : 2[4b000] -> 0[4b000] [receive] via NET/Socket/0
50e30e9f1c94:32432:32694 [1] NCCL INFO Channel 00/0 : 0[4b000] -> 2[4b000] [send] via NET/Socket/0
50e30e9f1c94:32433:32692 [2] NCCL INFO Connected all rings
50e30e9f1c94:32434:32693 [3] NCCL INFO Connected all rings
50e30e9f1c94:32431:32696 [0] NCCL INFO Connected all rings

可以看到配置一,同一节点内的GPU之间通信是使用shared memory; 配置二,同一节点内的GPU之间通信是使用socket 请问这是为什么呢?是不是同一台机器上的节点间通过共享内存(SHM)进行通信应该更合理?

Alternatives

No response

DarrenYing avatar Oct 11 '23 03:10 DarrenYing