DeepSpeed
DeepSpeed copied to clipboard
BLOOM inference bug [BUG]
Describe the bug
When I run the deepspeed inference for BLOOM I get stuck with Caught signal 7 (Bus error: nonexistent physical address).
To Reproduce
This is what I receive when I run
NCCL_DEBUG=INFO deepspeed --num_gpus 8 main_deepspeed.py --model_name microsoft/bloom-deepspeed-inference-int8 --dtype int8
console output
(The following is only a portion due to the character limit of github)
c664db21f8dd:90418:95360 [4] NCCL INFO Using network Socket
c664db21f8dd:90418:95360 [4] NCCL INFO Setting affinity for GPU 4 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
c664db21f8dd:90422:94332 [6] NCCL INFO Setting affinity for GPU 6 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
c664db21f8dd:90424:95103 [7] NCCL INFO Setting affinity for GPU 7 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
c664db21f8dd:90414:93817 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
c664db21f8dd:90416:93818 [2] NCCL INFO Setting affinity for GPU 2 to ffff0000,00000000,00000000,00000000,ffff0000
c664db21f8dd:90415:94075 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
c664db21f8dd:90420:94846 [5] NCCL INFO Setting affinity for GPU 5 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000,00000000
c664db21f8dd:90417:94589 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,00000000,00000000,00000000,ffff0000
1->4->3 [21] 5/-1/-1->4->3 [22] 5/-1/-1->4->3 [23] 5/-1/-1->4->3
c664db21f8dd:90422:94332 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 [22] 7/-1/-1->6->5 [23] 7/-1/-1->6->5
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 00/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90417:94589 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 [22] 4/-1/-1->3->2 [23] 4/-1/-1->3->2
c664db21f8dd:90424:95103 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 [22] -1/-1/-1->7->6 [23] -1/-1/-1->7->6
c664db21f8dd:90416:93818 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 [22] 3/-1/-1->2->1 [23] 3/-1/-1->2->1
c664db21f8dd:90415:94075 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 [22] 2/-1/-1->1->0 [23] 2/-1/-1->1->0
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 01/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 02/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 03/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 04/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 05/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 06/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 07/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 08/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 09/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 10/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 11/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 12/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 13/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 14/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 15/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 16/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 17/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 18/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 19/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 20/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 21/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 22/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 23/24 : 0 1 2 3 4 5 6 7
c664db21f8dd:90414:93817 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 00/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 00/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 00/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 00/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 01/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 01/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 01/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 01/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 02/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 00/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 02/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 02/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 02/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 03/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 01/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 03/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 03/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 03/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 04/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 02/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 04/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 04/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 04/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 05/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 03/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 05/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 05/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 05/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 06/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 04/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 06/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 06/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 06/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 07/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 05/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 07/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 07/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 07/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 08/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 06/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 08/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 08/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 08/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 09/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 07/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 09/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 09/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 09/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 10/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 08/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 10/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 10/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 10/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 11/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 09/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 11/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 11/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 11/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 12/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 10/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 12/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 12/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 12/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 13/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 11/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 13/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 13/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 13/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 14/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 12/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 14/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 14/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 14/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 15/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 13/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 15/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 15/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 15/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 16/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 14/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 16/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 16/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 16/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 17/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 15/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 17/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 17/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 17/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 18/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 16/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 18/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 18/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 18/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 19/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 17/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 19/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 19/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 19/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 20/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 18/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 20/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 20/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 20/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 21/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 19/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 21/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 21/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 21/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 22/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 20/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 22/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 22/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 22/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 23/0 : 4[87000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 7[bd000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 21/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 23/0 : 5[90000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 23/0 : 1[f000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 23/0 : 2[47000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 22/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90414:93817 [0] NCCL INFO Channel 23/0 : 0[7000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Connected all rings
c664db21f8dd:90422:94332 [6] NCCL INFO Connected all rings
c664db21f8dd:90424:95103 [7] NCCL INFO Connected all rings
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 00/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Connected all rings
c664db21f8dd:90418:95360 [4] NCCL INFO Connected all rings
c664db21f8dd:90414:93817 [0] NCCL INFO Connected all rings
c664db21f8dd:90417:94589 [3] NCCL INFO Connected all rings
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 01/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 02/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Connected all rings
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 03/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 04/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 05/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 06/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 07/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 08/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 09/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 10/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 11/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 12/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 13/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 14/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 15/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 16/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 17/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 18/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 19/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 20/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 21/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 22/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 00/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 00/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Channel 23/0 : 7[bd000] -> 6[b7000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 01/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 01/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 00/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 00/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 00/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 02/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 02/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 01/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 01/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 01/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 03/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 03/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 02/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 02/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 02/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 04/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 00/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 04/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 03/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 03/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 03/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 05/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 01/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 05/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 04/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 04/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 04/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 06/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 02/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 06/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 05/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 05/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 05/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 07/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 03/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 07/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 06/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 06/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 06/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 08/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 04/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 08/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 07/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 07/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 07/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 09/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 05/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 09/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 08/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 08/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 08/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 10/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 06/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 10/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 09/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 09/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 09/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 11/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 07/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 11/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 10/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 10/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 10/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 12/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 08/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 12/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 11/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 11/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 11/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 13/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 09/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 13/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 12/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 12/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 12/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 14/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 10/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 14/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 13/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 13/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 13/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 15/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 11/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 15/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 14/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 14/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 14/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 16/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 12/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 16/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 15/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 15/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 15/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 17/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 13/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 17/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 16/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 16/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 16/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 18/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 14/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 18/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 17/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 17/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 17/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 19/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 15/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 19/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 18/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 18/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 18/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 20/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 16/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 20/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 19/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 19/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 19/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 21/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 17/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 21/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 20/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 20/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 20/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 22/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 18/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 22/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 21/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 21/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 21/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90420:94846 [5] NCCL INFO Channel 23/0 : 5[90000] -> 4[87000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 19/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90422:94332 [6] NCCL INFO Channel 23/0 : 6[b7000] -> 5[90000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 22/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 22/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 22/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 20/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90418:95360 [4] NCCL INFO Channel 23/0 : 4[87000] -> 3[4e000] via P2P/IPC/read
c664db21f8dd:90416:93818 [2] NCCL INFO Channel 23/0 : 2[47000] -> 1[f000] via P2P/IPC/read
c664db21f8dd:90417:94589 [3] NCCL INFO Channel 23/0 : 3[4e000] -> 2[47000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 21/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 22/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90415:94075 [1] NCCL INFO Channel 23/0 : 1[f000] -> 0[7000] via P2P/IPC/read
c664db21f8dd:90424:95103 [7] NCCL INFO Connected all trees
c664db21f8dd:90424:95103 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90424:95103 [7] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90424:0:95364] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90414:93817 [0] NCCL INFO Connected all trees
c664db21f8dd:90414:93817 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90414:93817 [0] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90414:0:95363] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 95364) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95363) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
c664db21f8dd:90422:94332 [6] NCCL INFO Connected all trees
c664db21f8dd:90422:94332 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90422:94332 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90422:0:95362] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90420:94846 [5] NCCL INFO Connected all trees
c664db21f8dd:90420:94846 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90420:94846 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90420:0:95366] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90417:94589 [3] NCCL INFO Connected all trees
c664db21f8dd:90417:94589 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90417:94589 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
c664db21f8dd:90418:95360 [4] NCCL INFO Connected all trees
c664db21f8dd:90418:95360 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
[c664db21f8dd:90417:0:95367] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90418:95360 [4] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90418:0:95361] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90416:93818 [2] NCCL INFO Connected all trees
c664db21f8dd:90416:93818 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
c664db21f8dd:90416:93818 [2] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
c664db21f8dd:90415:94075 [1] NCCL INFO Connected all trees
c664db21f8dd:90415:94075 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
[c664db21f8dd:90416:0:95365] Caught signal 7 (Bus error: nonexistent physical address)
c664db21f8dd:90415:94075 [1] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer
[c664db21f8dd:90415:0:95368] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 95362) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95366) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95361) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95367) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95368) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
==== backtrace (tid: 95365) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bbc0 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
[2023-01-12 09:21:49,201] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90414
[2023-01-12 09:21:49,255] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90415
[2023-01-12 09:21:50,108] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90416
[2023-01-12 09:21:50,320] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90417
[2023-01-12 09:21:50,320] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90418
[2023-01-12 09:21:50,773] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90420
[2023-01-12 09:21:50,774] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90422
[2023-01-12 09:21:50,774] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 90424
[2023-01-12 09:21:50,774] [ERROR] [launch.py:324:sigkill_handler] ['/usr/bin/python', '-u', 'main_crosslingual_deepspeed.py', '--local_rank=7', '--model_name', 'microsoft/bloom-deepspeed-inference-int8', '--dtype', 'int8'] exits with return code = -7
I get the same problem if I use another model version like bigscience/bloom
However, when I run NCCL_DEBUG=INFO deepspeed --num_gpus 4 main_deepspeed.py --model_name microsoft/bloom-deepspeed-inference-int8 --dtype int8, i.e. change the number of GPUs to 4 (no matter which 4 of my 8 available I use), it works perfectly fine.
Any ideas? I'm stuck here, and two months ago everything worked fine.
Expected behavior The inference script should run without any errors.
ds_report output
console output
<frozen importlib._bootstrap>:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.0a0+936e930
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.6, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.8
System info (please complete the following information):
- OS: Ubuntu 20
- GPU count and types: x8 A100s (80Gb)
- 0.7.6 (needed to downgrade from 0.7.7 otherwise I got a meta tensor error)
- Hugging Face Transformers 4.25.1
- Accelerate 0.15.0
- Python version 3.8.10
Docker context I'm using the docker pytorch image 21.07py
This thread is a follow up to: https://github.com/microsoft/DeepSpeed/issues/2638
Hello, @felifri I have a similar issue, so I leave a comment with my situation to follow-up the progress.
I'm trying to run text-generation inference for GPT-like models. I run the HF Trasformers text-gen example code, the only difference is that I made the prefix cut from the dataset and use it instead of typing it myself. and some code I added for deepspeed is like below. (I'm trying to run inference with offload) I'm quite new to using deepspeed, so any comments would be very helpful.
deepspeed.init_distributed("nccl")
ds_config = {"zero_optimization": {"stage":3, "offload_param":{"device":"cpu", "pin_memory": True}, ~~}}
dschf = HfDeepSpeedConfig(ds_config)
deepspeed.zero.Init()
~~
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
~~
out = ds_engine.module.generate(~~)
Err msg: Caught signal 7 (Bus error: nonexistent physical address)
OS: Ubuntu 20.04 GPU: single node, x4 A100 (80GB) deepspeed: v0.7.2 (deepspeed docker) Hugging Face Transformers: v4.25.0.dev Python: v3.8.13
I got same issue, and found 5~8 GPU is un work, but 4 works well
Can you please try using the inference-test.py example from the DeepSpeedExamples repo?
The command to use should be:
deepspeed --num_gpus 8 inference-test.py --name microsoft/bloom-deepspeed-inference-int8 --ds_inference --use_kernel --use_meta_tensor --replace_method=auto --dtype int8 --checkpoint_path <path_to_ckpt if exists>
The test suite makes use of a DSPipeline utility class that handles meta tensor loading and has been tested for microsoft/bloom-deepspeed-inference-int8 and microsoft/bloom-deepspeed-inference-fp16.
I got the same issue with ColossalAI Zero2
This bug was caused by the small Docker Virtual Memory Size Just fix it by increasing the memory size
nvidia-docker run -it --rm --shm-size="6g" ...
Seems environment related, if @hijkzzz's fix doesn't work for you please re-open.
I got same issue, and found 5~8 GPU is un work, but 4 works well
same error, amazing