ACCL icon indicating copy to clipboard operation
ACCL copied to clipboard

Broadcast hangs on cyt_rdma

Open lawirz opened this issue 1 year ago • 8 comments

I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later

Calling Broadcast with 4MB hangs on the second rank.

Rank 0

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
rank: 0 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:93.336
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier
host measured durationUs:252146

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	0
                 Card reads sent: 	0
                Card writes sent: 	0
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- [31m[1mNET STATS[0m[0m QSFP0

RX pkgs: 738
TX pkgs: 1030
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 654
ROCE TX pkgs: 1028
IBV RX pkgs: 646
IBV TX pkgs: 66566
PSN drop cnt: 0
Retrans cnt: 384
TCP session cnt: 0
STRM down: 0


stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 92386
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fe00000, Size: 64
calling offload: 7fc95fe00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fc00000, Size: 64
calling offload: 7fc95fc00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f800000, Size: 4194304
calling offload: 7fc95f800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f400000, Size: 4194304
calling offload: 7fc95f400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f000000, Size: 4194304
calling offload: 7fc95f000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95ec00000, Size: 4194304
Broadcasting data from 0...
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95ec00000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fe00000
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fc00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f800000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f400000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f000000

Rank 1

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 1] rank 1 size 2 alveo-u55c-08.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 1 receiving remote QP from remote rank 0
Local rank 1 sending local QP to remote rank 0
Queue Pair: id: 0
Local Queue: local: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
Remote Queue: remote: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
rank: 1 FPGA IP: afd4a60
Rendezvous Protocol
sw nop time [us]:86.834
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier

stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 90744
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-08.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5e00000, Size: 64
calling offload: 7f2da5e00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5c00000, Size: 64
calling offload: 7f2da5c00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5800000, Size: 4194304
calling offload: 7f2da5800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5400000, Size: 4194304
calling offload: 7f2da5400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5000000, Size: 4194304
calling offload: 7f2da5000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 1 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f2da5e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f2da5c00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da4c00000, Size: 4194304
Getting broadcast data from 0...

Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.

The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.

Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.

I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.

Everything works in Simulator, in a variety of scenarios.

lawirz avatar Jun 19 '24 11:06 lawirz

Can confirm, that I observe similar behaviour when running Allreduce in isolation. I tried to run Allreduce with a size of just 2. The first run succeeded. On the secnd run, then the machine started hanging(Can't even reprogram anymore)

lawirz avatar Jun 19 '24 13:06 lawirz

I can also confirm, that the issues are not present on the commit before the 196 merge. Merge pull request

lawirz avatar Jun 19 '24 15:06 lawirz

You linked to #194, do you mean that or the PR that closed issue #196 ?

quetric avatar Jun 24 '24 12:06 quetric

I mean to say they are probably introduced in the 196-fix. The commit right before is the 194 merge(01f49d2), on which the issue is not present.

lawirz avatar Jun 24 '24 12:06 lawirz

Can you attach your code here? This doesn't look like it's from any of our tests.

quetric avatar Jun 24 '24 12:06 quetric

It's the test/host/Coyote/runscripts/run.sh with

TEST_MODE=(5) 
N_ELEMENTS=(1048576) # 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576

lawirz avatar Jun 24 '24 12:06 lawirz

Does this same test work against the emulator?

quetric avatar Jun 24 '24 12:06 quetric

I didn't try the equivalent as a isolated testcase. But the emulator works with the ProcessGroup with different sizes and repetitions, while in hardware it shows behaviour like this very quickly

lawirz avatar Jun 24 '24 12:06 lawirz