perftest icon indicating copy to clipboard operation
perftest copied to clipboard

ib_write_bw work normally but ib_write_bw -R failed

Open sjc2870 opened this issue 3 years ago • 9 comments

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.

This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

sjc2870 avatar Jan 17 '22 13:01 sjc2870

Hi, please make sure to use the interface ip instead of host name when using -R option

HassanKhadour avatar Jan 17 '22 19:01 HassanKhadour

Hi, please make sure to use the interface ip instead of host name when using -R option

Thanks!I tried to use the ip interface before your reply,but still failed. And your reply reminded me that I need to use the address of the ib network card but not tcp/ip... Thanks a lot for your reply! Wish you good health and every success!

sjc2870 avatar Jan 18 '22 02:01 sjc2870

Hi sjc2870, thanks! Wish you the same. does it still repro? did you solve the Issue?

HassanKhadour avatar Nov 10 '22 08:11 HassanKhadour

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why.

This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

I failed at this step, I don't know what happened “Failed to modify QP to RTS Unable to Connect the HCA's through the link”

Taco0220 avatar Aug 24 '23 07:08 Taco0220

This is output of 'ib_write_bw -a -d mlx5_0 --report_gbits node1', seems to work fine:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  node1
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : mlx5_0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x02 QPN 0x0032 PSN 0x5f2841 RKey 0x002440 VAddr 0x007f2e5f64b000
 remote address: LID 0x01 QPN 0x003a PSN 0xc0fd7e RKey 0x002442 VAddr 0x007f443b2f8000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 2          5000           0.042350            0.042185            2.636533
 4          5000           0.084507            0.084457            2.639276
 8          5000             0.17               0.17   		   2.641399
 16         5000             0.34               0.34   		   2.640952
 32         5000             0.68               0.68   		   2.638957
 64         5000             1.35               1.35   		   2.638629
 128        5000             2.71               2.71   		   2.643606
 256        5000             5.42               5.42   		   2.644112
 512        5000             10.78              10.77  		   2.629186
 1024       5000             21.38              21.37  		   2.608802
 2048       5000             42.13              42.09  		   2.568967
 4096       5000             83.97              83.91  		   2.560721
 8192       5000             186.89             149.84 		   2.286319
 16384      5000             195.18             169.98 		   1.296822
 32768      5000             196.21             185.39 		   0.707209
 65536      5000             196.25             190.26 		   0.362886
 131072     5000             196.33             193.93 		   0.184945
 262144     5000             195.49             195.03 		   0.092996
 524288     5000             196.25             196.25 		   0.046789
 1048576    5000             196.48             196.48 		   0.023422
 2097152    5000             196.62             196.59 		   0.011718
 4194304    5000             196.67             196.63 		   0.005860
 8388608    5000             196.63             196.58 		   0.002929
---------------------------------------------------------------------------------------

But it would fail if I plus '-R', like:

[root@node3 bin]# ib_write_bw -a -d mlx5_0 --report_gbits  -R node1
Received 10 times ADDR_ERROR
 Unable to perform rdma_client function
 Unable to init the socket connection

And I read source code and have known it's caused by RDMA_CM_EVENT_ADDR_ERROR, but I don't known why. This is output about 'lscpi -vvv':

[root@node3 bin]#  lspci -vvv | grep Mellanox  -A 65
41:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
	Subsystem: Mellanox Technologies Device 0007
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 1125
	NUMA node: 0
	Region 0: Memory at 2807e000000 (64-bit, prefetchable) [size=32M]
	Expansion ROM at b4400000 [disabled] [size=1M]
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 75.000W
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 512 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 <4us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [48] Vital Product Data
		Product Name: ConnectX-6 VPI adapter card, HDR IB (200Gb/s) and 200GbE, single-port QSFP56                                                                                                          
		Read-only fields:
			[PN] Part number: MCX653105A-HDAT          
			[EC] Engineering changes: AE
			[V2] Vendor specific: MCX653105A-HDAT          
			[SN] Serial number: MT2130T07644   
			[V3] Vendor specific: 92a87ffbcbeaeb118000b8cef6f7f1c0
			[VA] Vendor specific: MLX:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0:MODL=CX653105A      
			[V0] Vendor specific: PCIeGen4 x16 
			[VU] Vendor specific: MT2130T07644MLNXS0D0F0 
			[RV] Reserved: checksum good, 1 byte(s) reserved
		End
	Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=0 offset=00003000
	Capabilities: [c0] Vendor Specific Information: Len=18 <?>
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot-,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO+ CmpltAbrt- UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		AERCap:	First Error Pointer: 04, GenCap+ CGenEn+ ChkCap+ ChkEn+
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1c0 v1] #19
	Capabilities: [320 v1] #27
	Capabilities: [370 v1] #26
	Capabilities: [420 v1] #25
	Kernel driver in use: mlx5_core
	Kernel modules: mlx5_core

42:00.0 Non-Volatile memory controller: Intel Corporation NVMe DC SSD [3DNAND, Beta Rock Controller] (prog-if 02 [NVM Express])
	Subsystem: Intel Corporation Device 8008

Any clue about what happened? look forward to your reply, thanks!

I failed at this step, I don't know what happened “Failed to modify QP to RTS Unable to Connect the HCA's through the link”

Please try to use the interface ip and not hostname when running rdmacm

HassanKhadour avatar Aug 24 '23 07:08 HassanKhadour

I use the interface IP: server error: ethernet_read_keys: Couldn't read remote address Unable to read to socket/rdma_cm Failed to exchange data between server and clients client error: Failed to modify QP to RTS Unable to Connect the HCA's through the link

Taco0220 avatar Aug 24 '23 08:08 Taco0220

Can you please share the setup info, OS, cards etc.. so I can try to reproduce the issue?

HassanKhadour avatar Aug 24 '23 08:08 HassanKhadour

sorry,my os is '6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC'。netcard is Intel Corporation Ethernet Connection X722.

Taco0220 avatar Aug 24 '23 09:08 Taco0220