openSeaChest X18 and X24 disks frequently reset with SAS3008 HBAs under heavy write load

I have a bunch (11 each) of ST24000NM000C and ST16000NM001G drives that cause major issues with my SAS3008-based HBA (the onboard HBA on the Supermicro H12SSL-CT, but also just on a regular 9300-8i). Specifically the HBA hits some failure mode under heavy write loads to these new X24's and the driver triggers a whole HBA reset. Heavy reads seem to not be affected.

The X18 default EPC settings vary vs. the X24's. They seem to have Idle_A set to 1 and Idle_B set to 1200; the X24 firmware only has Idle_A set to 1. The first time I saw this occur, I disabled EPC on the new X24's with --EPCfeature disable, and I thought it was resolved, but the next time I had a pretty sustained write load it happened again.

I didn't have this issue when it was purely the X18 disks on this adapter. It was only once the X24s were added to the mix that I saw this occur. It also does not occur with HGST/WD disks.

All X18 disks are on SN02, except one RMA refurbed ST16000NM000J on SN04. All X24 disks are on SN02. The SAS3008 HBA is on 16.00.14.00. It is actively cooled and temp is monitored and not overheating. Disks are all attached on a Supermicro 846 SAS3 backplane/LSI expander on 66.16.11.00. Kernel is 6.10.11-amd64, current Debian testing/trixie.

Here's dmesg during a heavy write load triggering the problem:

[Wed Oct 16 01:13:02 2024] mpt3sas_cm0 fault info from func: mpt3sas_base_make_ioc_ready
[Wed Oct 16 01:13:02 2024] mpt3sas_cm0: fault_state(0x5854)!
[Wed Oct 16 01:13:02 2024] mpt3sas_cm0: sending diag reset !!
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: diag reset: SUCCESS
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: In func: _ctl_do_mpt_command
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: Command terminated due to Host Reset
[Wed Oct 16 01:13:03 2024] mf:

[Wed Oct 16 01:13:03 2024] 0000000b
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000018
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000008
[Wed Oct 16 01:13:03 2024]

[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 0000000a
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 02000000
[Wed Oct 16 01:13:03 2024]

[Wed Oct 16 01:13:03 2024] 00000025
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000
[Wed Oct 16 01:13:03 2024] 00000000

[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: _base_display_fwpkg_version: complete
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: overriding NVDATA EEDPTagMode setting
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: LSISAS3008: FWVersion(16.00.14.00), ChipRevision(0x02)
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[Wed Oct 16 01:13:03 2024] mpt3sas_cm0: sending port enable !!
[Wed Oct 16 01:13:10 2024] mpt3sas_cm0: port enable: SUCCESS
[Wed Oct 16 01:13:10 2024] mpt3sas_cm0: search for end-devices: start
[Wed Oct 16 01:13:10 2024] scsi target0:0:0: handle(0x000a), sas_addr(0x5003048017ab9940)
[Wed Oct 16 01:13:10 2024] scsi target0:0:0: enclosure logical id(0x5003048017ab997f), slot(0)
[Wed Oct 16 01:13:10 2024] scsi target0:0:1: handle(0x000b), sas_addr(0x5003048017ab9941)
[Wed Oct 16 01:13:10 2024] scsi target0:0:1: enclosure logical id(0x5003048017ab997f), slot(1)
[Wed Oct 16 01:13:10 2024] scsi target0:0:2: handle(0x000c), sas_addr(0x5003048017ab9942)
[Wed Oct 16 01:13:10 2024] scsi target0:0:2: enclosure logical id(0x5003048017ab997f), slot(2)
[Wed Oct 16 01:13:10 2024] scsi target0:0:3: handle(0x000d), sas_addr(0x5003048017ab9943)
[Wed Oct 16 01:13:10 2024] scsi target0:0:3: enclosure logical id(0x5003048017ab997f), slot(3)
[Wed Oct 16 01:13:10 2024] scsi target0:0:4: handle(0x000e), sas_addr(0x5003048017ab9944)
[Wed Oct 16 01:13:10 2024] scsi target0:0:4: enclosure logical id(0x5003048017ab997f), slot(4)
[Wed Oct 16 01:13:10 2024] scsi target0:0:5: handle(0x000f), sas_addr(0x5003048017ab9945)
[Wed Oct 16 01:13:10 2024] scsi target0:0:5: enclosure logical id(0x5003048017ab997f), slot(5)
[Wed Oct 16 01:13:10 2024] scsi target0:0:6: handle(0x0010), sas_addr(0x5003048017ab9946)
[Wed Oct 16 01:13:10 2024] scsi target0:0:6: enclosure logical id(0x5003048017ab997f), slot(6)
[Wed Oct 16 01:13:10 2024] scsi target0:0:7: handle(0x0011), sas_addr(0x5003048017ab9947)
[Wed Oct 16 01:13:10 2024] scsi target0:0:7: enclosure logical id(0x5003048017ab997f), slot(7)
[Wed Oct 16 01:13:10 2024] scsi target0:0:8: handle(0x0012), sas_addr(0x5003048017ab9948)
[Wed Oct 16 01:13:10 2024] scsi target0:0:8: enclosure logical id(0x5003048017ab997f), slot(8)
[Wed Oct 16 01:13:10 2024] scsi target0:0:9: handle(0x0013), sas_addr(0x5003048017ab9949)
[Wed Oct 16 01:13:10 2024] scsi target0:0:9: enclosure logical id(0x5003048017ab997f), slot(9)
[Wed Oct 16 01:13:10 2024] scsi target0:0:10: handle(0x0014), sas_addr(0x5003048017ab994a)
[Wed Oct 16 01:13:10 2024] scsi target0:0:10: enclosure logical id(0x5003048017ab997f), slot(10)
[Wed Oct 16 01:13:10 2024] scsi target0:0:11: handle(0x0015), sas_addr(0x5003048017ab994b)
[Wed Oct 16 01:13:10 2024] scsi target0:0:11: enclosure logical id(0x5003048017ab997f), slot(11)
[Wed Oct 16 01:13:10 2024] scsi target0:0:12: handle(0x0016), sas_addr(0x5003048017ab995c)
[Wed Oct 16 01:13:10 2024] scsi target0:0:12: enclosure logical id(0x5003048017ab997f), slot(12)
[Wed Oct 16 01:13:10 2024] scsi target0:0:13: handle(0x0017), sas_addr(0x5003048017ab995d)
[Wed Oct 16 01:13:10 2024] scsi target0:0:13: enclosure logical id(0x5003048017ab997f), slot(13)
[Wed Oct 16 01:13:11 2024] scsi target0:0:14: handle(0x0018), sas_addr(0x5003048017ab995e)
[Wed Oct 16 01:13:11 2024] scsi target0:0:14: enclosure logical id(0x5003048017ab997f), slot(14)
[Wed Oct 16 01:13:11 2024] scsi target0:0:15: handle(0x0019), sas_addr(0x5003048017ab995f)
[Wed Oct 16 01:13:11 2024] scsi target0:0:15: enclosure logical id(0x5003048017ab997f), slot(15)
[Wed Oct 16 01:13:11 2024] scsi target0:0:16: handle(0x001a), sas_addr(0x5003048017ab9960)
[Wed Oct 16 01:13:11 2024] scsi target0:0:16: enclosure logical id(0x5003048017ab997f), slot(16)
[Wed Oct 16 01:13:11 2024] scsi target0:0:17: handle(0x001b), sas_addr(0x5003048017ab9961)
[Wed Oct 16 01:13:11 2024] scsi target0:0:17: enclosure logical id(0x5003048017ab997f), slot(17)
[Wed Oct 16 01:13:11 2024] scsi target0:0:18: handle(0x001c), sas_addr(0x5003048017ab9963)
[Wed Oct 16 01:13:11 2024] scsi target0:0:18: enclosure logical id(0x5003048017ab997f), slot(19)
[Wed Oct 16 01:13:11 2024] scsi target0:0:19: handle(0x001d), sas_addr(0x5003048017ab9964)
[Wed Oct 16 01:13:11 2024] scsi target0:0:19: enclosure logical id(0x5003048017ab997f), slot(20)
[Wed Oct 16 01:13:11 2024] scsi target0:0:20: handle(0x001e), sas_addr(0x5003048017ab9966)
[Wed Oct 16 01:13:11 2024] scsi target0:0:20: enclosure logical id(0x5003048017ab997f), slot(22)
[Wed Oct 16 01:13:11 2024] scsi target0:0:21: handle(0x001f), sas_addr(0x5003048017ab9967)
[Wed Oct 16 01:13:11 2024] scsi target0:0:21: enclosure logical id(0x5003048017ab997f), slot(23)
[Wed Oct 16 01:13:11 2024] scsi target0:0:22: handle(0x0020), sas_addr(0x5003048017ab997d)
[Wed Oct 16 01:13:11 2024] scsi target0:0:22: enclosure logical id(0x5003048017ab997f), slot(24)
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for end-devices: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for end-devices: start
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for PCIe end-devices: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for expanders: start
[Wed Oct 16 01:13:11 2024]      expander present: handle(0x0009), sas_addr(0x5003048017ab997f), port:255
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: search for expanders: complete
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Wed Oct 16 01:13:11 2024] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Wed Oct 16 01:13:11 2024] sd 0:0:0:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:4:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:9:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:1:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:11:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:3:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:17:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:6:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:7:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:8:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:10:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:12:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:13:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:14:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:15:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:16:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:18:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:19:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:20:0: Power-on or device reset occurred
[Wed Oct 16 01:13:11 2024] sd 0:0:2:0: device_block, handle(0x000c)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: end-devices
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: Removing unresponding devices: pcie end-devices
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: expanders
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: removing unresponding devices: complete
[Wed Oct 16 01:13:12 2024] sd 0:0:2:0: device_unblock and setting to running, handle(0x000c)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: scan devices: start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: expanders start
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: attempting task abort!scmd(0x00000000bfccca11), outstanding for 2948 ms & timeout 1000 ms
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: [sde] tag#187 CDB: ATA command pass through(16) 85 08 0e 00 d5 00 01 00 e0 00 4f 00 c2 00 b0 00
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: handle(0x000f), sas_address(0x5003048017ab9945), phy(5)
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: enclosure logical id(0x5003048017ab997f), slot(5)
[Wed Oct 16 01:13:12 2024] scsi target0:0:5: enclosure level(0x0000), connector name(     )
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: expanders complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: end devices start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: end devices complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         scan devices: pcie end devices start
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: log_info(0x3003011d): originator(IOP), code(0x03), sub_code(0x011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         break from pcie end device scan: ioc_status(0x0021), loginfo(0x3003011d)
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0:         pcie devices: pcie end devices complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: scan devices: complete
[Wed Oct 16 01:13:12 2024] mpt3sas_cm0: device is not present handle(0x000c), flags!!!
[Wed Oct 16 01:13:12 2024] sd 0:0:5:0: task abort: SUCCESS scmd(0x00000000bfccca11)
[Wed Oct 16 01:13:20 2024] sd 0:0:2:0: Power-on or device reset occurred
[Wed Oct 16 01:13:20 2024] sd 0:0:5:0: Power-on or device reset occurred
[Wed Oct 16 01:13:20 2024] sd 0:0:21:0: Power-on or device reset occurred

I contacted Seagate support and uh, they told me to install some Windows-only software to monitor for firmware updates and didn't know how to respond to anything technical at all. So I hope maybe through you guys this info is useful.

Oct 16 '24 06:10 putnam

Hi @putnam,

Sorry you are having issues in your system. To make sure I am following your issue this is what you have seen happening:

Installed new drives, started seeing resets during heavy loads
Disabled EPC and seemed to resolve it
Resets are back

Is this correct?

From the standards, disabling EPC should hold across resets and power cycles. Are you seeing that the EPC feature is being enabled again even when you have not sent the enable?

As for firmware updates, sometimes those can help (both HBA side and drive side). From the Seagate support site there is a Firmware update finder that you can provide a serial number to check for new firmware. You don't need the other Windows only tool (it basically scans and opens that webpage for you with the SN already loaded). If you scroll to the bottom of this page you can provide a serial number of a drive to check for newer firmware. I don't know if that would resolve the issue, but you can try it.

I am asking around to see if any of the customer support engineers have run into this as well, but I have not heard anything yet.

Oct 16 '24 17:10 vonericsen

Thanks so much for the response. I edited my original ticket a lot, so I think you're responding to the initial version. I realized, looking at bash history and the state of the disks, that:

--EPCfeature disable did actually persist. You're right.
Disabling EPC didn't resolve the issues with the X24 disks after all, because I didn't truly load them with writes. Once they were loaded with writes again the same behavior came back.

I'm sure this is now outside the scope of this repo, but you guys have been so useful in the past when reporting possible firmware bugs. Maybe it's useful to have shared it here anyway. I'm not an enterprise customer, just an end user, so it's hard to get a line to someone with inside engineering connections.

I can repro more consistently now by just copying a lot of data to the disks. I have found very little info on these particular 20TB models since I understand they're technically binned/refurbed X24 HAMR disks. It may well be an issue with the LSI/Broadcom firmware or even mpt3sas, but again it doesn't repro on my 60+ HGST/WD disks or the X16's on their own.

Since we're almost certainly outside the scope of openSeaChest here feel free to close but if it's something you guys are open to pursuing with more debug data and info I could share it here or over email privately.

Regarding firmware on the end user portal there's no update available for these yet.

Oct 17 '24 00:10 putnam

@putnam,

I did pass this issue along to some people internally to see if they've seen similar problems before with these drives and hardware, but I have not heard anything yet.

If you dump the SATA phy event counters, are you seeing those increase at all? openSeaChest_Info -d <handle> --showPhyEvents

If these are increasing (not just the reset counter, but others) if can point towards a cabling issue.

I'll see if there is anything else I can think of trying that might also help debug this.

Oct 17 '24 16:10 vonericsen

Thanks for the reply! OK, so here are the PHY counters from openSeaChest_Info --showPhyEvents for the different Seagate disks hanging onto this backplane/controller. Do you know whether this is a rolling window or lifetime? Back in September when I first got these disks, I replaced the internal SAS cables due to CRC errors during the initial ZFS resilvering. You know, changing firmwares and cables and the workload is always a bunch of variable juggling and I don't want to get it wrong here. But when I changed the internal cables, the resilver continued without any issue or drops and the CRC errors went away at that time. I also have multiple brand new 3M and Amphenol cables on the shelf here and can swap them in to try to eliminate the cable variable one more time, if you like. It wouldn't be the first, or the second, or the third time that cabling randomly came up. In the last 10+ years of dealing with SAS2/SAS3 I feel like cables are an evergreen issue that everyone faces.

Anyway, the resets I see now are specifically when ZFS is copying a large amount of data to the pool and is lighting up the vdevs made up of Seagate devices for a sustained amount of time. Eventually, you see the same message about the HBA resetting with the same fault code in mpt3sas. I did some digging in the mpt3sas driver hoping to find some bitflags or something to identify the fault code but it looks to be internal/proprietary to Broadcom/LSI.

20TB X24 Disks (Newer)

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA04RL6 - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   16 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    2 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA09AWD - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   11 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    2 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA09KJB - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   12 H2D FISes sent due to COMRESET
     1                    3 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    3 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    3 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA09QQH - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   13 H2D FISes sent due to COMRESET
     1                    0 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    0 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    0 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA09RSQ - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   16 H2D FISes sent due to COMRESET
     1                    3 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    3 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    3 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0BKFL - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   10 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    2 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160234 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0C241 - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   17 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    2 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160235 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0CWPX - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   13 H2D FISes sent due to COMRESET
     1                    4 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    4 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    4 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160235 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0D2EL - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   12 H2D FISes sent due to COMRESET
     1                    0 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    0 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    0 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160235 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0EWXY - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   13 H2D FISes sent due to COMRESET
     1                    5 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    5 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS
    11                    5 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160235 User: root
==========================================================================================

 - ST24000NM000C-3WD103 - ZXA0GJGN - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   19 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    1 R_ERR response for H2D non-data FIS
    11                    2 CRC errors withing H2D FIS
    13                    0 Non-CRC errors within H2D FIS

16TB X18 Disks (Older, pre-existing without resets)

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160554 User: root
==========================================================================================

 - ST16000NM000J-2TW103 - ZR5ECA55 - SN04 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    3 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160554 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL20CAJ9 - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    4 H2D FISes sent due to COMRESET
     1                    0 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    0 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160554 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL20D3TL - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    7 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160554 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL20YT4M - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    5 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160554 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL213QN0 - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    4 H2D FISes sent due to COMRESET
     1                    0 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    0 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21909L - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    7 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21AHY7 - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   12 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    1 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21L84X - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                   16 H2D FISes sent due to COMRESET
     1                    2 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    2 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21L97Y - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    4 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21LGDW - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    4 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

==========================================================================================
 openSeaChest_Info - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2024 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Info Version: 2.7.0-8_0_1 X86_64
 Build Date: Sep 25 2024
 Today: 20241017T160555 User: root
==========================================================================================

 - ST16000NM001G-2KK103 - ZL21LYZC - SN02 - ATA

====SATA Phy Event Counters====
V = Vendor Unique event tracker
M = Counter maximum value reached
D2H = Device to Host
H2D = Host to Device
    ID                Value Description
    10                    6 H2D FISes sent due to COMRESET
     1                    1 Command failed with iCRC error
     3                    0 R_ERR response for D2H data FIS
     4                    1 R_ERR response for H2D data FIS
     6                    0 R_ERR response for D2H non-data FIS
     7                    0 R_ERR response for H2D non-data FIS

Oct 17 '24 21:10 putnam

Do you know whether this is a rolling window or lifetime?

For this page it continues counting until you reset the counters on the page. I don't remember if we put that in as an option in openSeaChest yet. I will have to review the code.

The reason I mentioned the CRC errors is due to some of my own past experience trying to troubleshoot some issues other customers have seen.

I have also had some long conversations with one of the Seagate engineers who works on the phy level with the goal of figuring out a way to write a test for detecting a bad cable. It's not an easy task 😆 but we did come up with some ideas including using these logs. I have not had time to implement it yet, but it will be an expanded version of the openSeaChest_GenericTests --bufferTest routine I already have....sometimes that will detect an error, but it runs far too short of a time to be reliable right now.

One thing I learned from him was that the faster the interface is running (6Gb/s vs 3Gb/s) the sooner you notice signaling issues. The most common is seeing the CRC counters increasing. This is often increasing due to a cabling problem....not always, but in your case I suspect it is since it's happening on multiple different drives, even drives that were not previously having an issue. It's possible that these new drives have a slightly different phy behavior that managed to bring this out. There are a couple different issues that can happen on the bus that HBAs and drives are both trying to mitigate (such as signal reflections) but sometimes that can only go so far before it's no longer correctable. There are also limits on how many signal level issues can be worked around and with these new drives maybe some existing problem was manageable that is no longer manageable (just guessing here).

Another thing that can happen (and I have experienced myself) is similar things happen as the backplane connectors wear out from plugging and unplugging drives. Eventually all connectors will fail but as you approach the insertion count limit you can start to see these kinds of issues too.

I don't know if any of these will solve the issue, but you can try these things:

Unplug the drives and plug them back in (sometimes this reseats the connector better and may mitigate this issue)
If you have backplanes and can replace them easily, maybe give it a try
Replacing cables in the system.

openSeaChest_Configure also has an option to set the phy speed lower as well, which you can also try but it may limit your maximum sequential read/write on more modern drives. DO NOT go below 3.0Gb/s though. I found out that some modern SAS/SATA controllers no longer support 1.5Gb/s and once the drive is set to that you will have to track down another HBA that does support that low speed to restore it to a higher speed. I found this in the HBA documentation, so you can also check that to see what it supports first.

One last thing I want to mention is that if you can check for updates on the HBA firmware that may also help. I have seen that resolve odd behavior issues as well due to fixes made to the HBA's firmware. I have seen some past Broadcom HBA's resolve some odd phy issues before, but I don't know if that is affecting this specific case.

Let me know if this helps. I'll see if I can talk to that signal engineer I mentioned about this to see if he has any other ideas.

Oct 17 '24 23:10 vonericsen

Thanks. Will go over and try. Regarding the HBA, it's a pretty common SAS3008 HBA and on latest firmware (16.00.14.00). The backplane hasn't had a ton of insertion cycles, but reseating can't hurt. I will swap to a new-in-bag Amphenol cable set + reseat disks and see if I can repro again and report back.

Oct 17 '24 23:10 putnam

@putnam,

Did swapping cables make a difference in your case?

Another idea is to see if the HBA's BIOS/UEFI settings allow disabling link power management. I am not sure if that is supported by your HBA or not, but I had an issue that rings a lot of very similar bells to this reported to me and in that case disabling the link power management in the BIOS/UEFI for the AHCI card stopped the resets. I'm not sure if that will be the solution here, but something else you can check.

Nov 12 '24 19:11 vonericsen

No, unfortunately it has not, after cooking a while. I changed it out and left town -- still out of town at the moment until next week -- but I'm still seeing the same behavior under only a little bit of write load. And right now it only seems to affect the newer X24 disks.

Link power management (ASPM) is disabled I think. You can see the state of it in lspci -vv:

---snip---
41:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
        DeviceName: LSI 3008 SAS
        Subsystem: Super Micro Computer Inc AOC-S3008L-L8e
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 353
        IOMMU group: 26
        Region 0: I/O ports at 7000 [size=256]
        Region 1: Memory at b1840000 (64-bit, non-prefetchable) [size=64K]
        Region 3: Memory at b1800000 (64-bit, non-prefetchable) [size=256K]
        Expansion ROM at b1700000 [disabled] [size=1M]
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr+ UnsupReq-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr+ FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM not supported
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x8
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range BC, TimeoutDis+ NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
---snip---

Note under LnkCtl it says ASPM Disabled.

As I write this I'm taking a minute on vacation to prop up the array before it goes totally offline. This has happened before, but one X24 disk actually got knocked so offline it hasn't come back. It will need a physical unplug/replug or a full server power cycle to bring it back up.

As far as drive power, I reconfirmed their EPC settings across the board:

root@dwight:~/seagate/openseachest_exes# ./openSeaChest_PowerControl --showEPCSettings -d /dev/sdb
==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.0.2-2_2_3 X86_64
 Build Date: Jun 21 2021
 Today: Fri Nov 22 08:13:27 2024	User: root
==========================================================================================

/dev/sg1 - ST24000NM000C-3WD103 - XXXXXXXX - ATA
.

===EPC Settings===
	* = timer is enabled
	C column = Changeable
	S column = Savable
	All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A      0            *1            *1            1             Y Y
Idle B      0             1200          1200         4             Y Y
Idle C      0             6000          6000         20            Y Y
Standby Z   0             9000          9000         110           Y Y

Kind of at a loss as to what to do with it right now besides swap in another vendor. There must be something going on between the firmware and the controller but I don't know where else to look.

Nov 22 '24 14:11 putnam

@putnam We're asking about "SATA Link power management" (putting the SATA Phy connection to sleep), rather than ASPM (putting the PCI-e link to sleep).

I think you can see if this is enabled by running:

openSeaChest_SMART -d /dev/sdb --SATInfo

and looking for whether SATA Device Initiated Power Management has [Enabled] at the end of the same line.

Nov 22 '24 16:11 lwfitzgerald

Ah, sorry. Here is the output on a sample X24 disk. It doesn't have [Enabled] at the end:

==========================================================================================
 openSeaChest_SMART - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_SMART Version: 2.0.1-2_2_3 X86_64
 Build Date: Jun 21 2021
 Today: Sat Nov 23 06:50:48 2024	User: root
==========================================================================================

/dev/sg0 - ST24000NM000C-3WD103 - ZXXXXXXX - ATA
SCSI Translator Reported Information:
	Vendor ID: ATA     
	Model Number: ST24000NM000C-3W
	Serial Number: ZXXXXXXX
	Firmware Revision: SN02
	SAT Vendor ID: LSI     
	SAT Product ID: LSI SATL        
	SAT Product Rev: 0008
	World Wide Name: XXXXXXX
	Drive Capacity (TB/TiB): 24.00/21.83
	Temperature Data:
		Current Temperature (C): 34
		Highest Temperature (C): Not Reported
		Lowest Temperature (C): Not Reported
	Power On Time:  80 days 3 hours 
	Power On Hours: 1923.00
	MaxLBA: 46875541503
	Native MaxLBA: Not Reported
	Logical Sector Size (B): 512
	Physical Sector Size (B): 4096
	Sector Alignment: 0
	Rotation Rate (RPM): 7200
	Form Factor: 3.5"
	Last DST information:
		DST has never been run
	Long Drive Self Test Time:  18 hours 13 minutes 
	Interface speed:
		Not Reported
	Annualized Workload Rate (TB/yr): Not Reported
	Total Bytes Read (B): Not Reported
	Total Bytes Written (B): Not Reported
	Encryption Support: Not Supported
	Cache Size (MiB): Not Reported
	Read Look-Ahead: Enabled
	Write Cache: Enabled
	SMART Status: Good
	ATA Security Information: Supported
	Firmware Download Support: Full, Segmented
	Specifications Supported:
		SPC-4
		SAM-4
		SAT-3
		SPC-4
		SBC-3
		SAS
		ATA8-ACS
		ZBC
	Features Supported:
		SAT
		ATA Security
		Self Test
		Automatic Write Reassignment [Enabled]
		EPC [Enabled]
		Informational Exceptions [Mode 6]
	Adapter Information:
		Vendor ID: 1000h
		Product ID: 0097h
		Revision: 0002h
ATA Reported Information:
	Model Number: ST24000NM000C-3WD103
	Serial Number: ZXXXXXX
	Firmware Revision: SN02
	World Wide Name: XXXXXXXX
	Drive Capacity (TB/TiB): 24.00/21.83
	Native Drive Capacity (TB/TiB): 24.00/21.83
	Temperature Data:
		Current Temperature (C): 34
		Highest Temperature (C): 51
		Lowest Temperature (C): 29
	Power On Time:  80 days 3 hours 
	Power On Hours: 1923.00
	MaxLBA: 46875541503
	Native MaxLBA: 46875541503
	Logical Sector Size (B): 512
	Physical Sector Size (B): 4096
	Sector Alignment: 0
	Rotation Rate (RPM): 7200
	Form Factor: 3.5"
	Last DST information:
		DST has never been run
	Long Drive Self Test Time:  1 day 15 hours 1 minute 
	Interface speed:
		Max Speed (Gb/s): 6.0
		Negotiated Speed (Gb/s): 6.0
	Annualized Workload Rate (TB/yr): 309.50
	Total Bytes Read (TB): 56.03
	Total Bytes Written (TB): 11.91
	Encryption Support: Not Supported
	Cache Size (MiB): 512.00
	Read Look-Ahead: Enabled
	Write Cache: Enabled
	Low Current Spinup: Disabled
	SMART Status: Unknown or Not Supported
	ATA Security Information: Supported
	Firmware Download Support: Full, Segmented, Deferred
	Specifications Supported:
		ACS-5
		ACS-4
		ACS-3
		ACS-2
		ATA8-ACS
		ATA/ATAPI-7
		ATA/ATAPI-6
		ATA/ATAPI-5
		SATA 3.3
		SATA 3.2
		SATA 3.1
		SATA 3.0
		SATA 2.6
		SATA 2.5
		SATA II: Extensions
		SATA 1.0a
		ATA8-AST
	Features Supported:
		Sanitize
		SATA NCQ
		SATA Software Settings Preservation [Enabled]
		SATA Device Initiated Power Management
		Power Management
		Security
		SMART [Enabled]
		48bit Address
		PUIS
		GPL
		Streaming
		SMART Self-Test
		SMART Error Logging
		Write-Read-Verify
		DSN
		AMAC
		EPC
		Sense Data Reporting
		SCT Write Same
		SCT Error Recovery Control
		SCT Feature Control
		SCT Data Tables
		Host Logging
		Set Sector Configuration
		Storage Element Depopulation
		Seagate In Drive Diagnostics (IDD)
	Adapter Information:
		Vendor ID: 1000h
		Product ID: 0097h
		Revision: 0002h

Nov 23 '24 12:11 putnam

@putnam,

Thanks for sharing that additional information.

There are 2 parts to power management of the phy on both SATA and SAS: Host initiated, and Device initiated (sometimes abbreviated HIPM and DIPM).

Please note I DO NOT recommend using openSeaChest_PowerControl to enable the device-initiated power management. If your system has not already enabled it on its own, enabling it may make the drive inaccessible. That option was added to the tool due to some customer request, but if you are not certain if your hardware supports it, I recommend leaving it as-is. The chipset or HBA should be enabling it themselves when it is supported and compatible. I have had a few people report issues around this internally because they enabled it on a system that was unable to wake the phy back up.

There are a few SATA capability bits that are not part of the humanized -i output today that might provide a few more clues. Can you share the output of openSeaChest_PowerControl -d <hande> -i -v4 | tee verboseIdentify.txt? This will have the raw data from the drive and I can review it manually to see if that gives some other details that may be useful.

Nov 26 '24 23:11 vonericsen

Thanks for the response @vonericsen -- I have actually gotten myself into that situation before and can confirm it's not a good idea :)

Here is the output of that command for an example disk that was at the top of the stack on the last set of resets.

verboseIdentify.txt

Nov 27 '24 01:11 putnam

EDIT: Most of this is still accurate, but the power transitions reported by smartd (0x81->0xFF) are expected because they're on WD disks that have EPC enabled.

I'm still bashing on this. I've been trying to reduce things down to a reliable repro and I'm not quite there, but let me explain my test setup.

The server has a zpool made up of many vdevs from different vendors; one of the Seagate vdevs is made up of 11x 16TB Exos disks and another is made up of 11x 24TB Exos disks. Both of these sets live on the same backplane, which is attached to a SAS3008-based controller that's built-in on the motherboard, the Supermicro H12SSL-CT. The HBA is functionally the same as a 9300-8i and shares the same firmware image.

I am creating a continuous synthetic load by copying a 100GB random file from a scratch disk into a test dataset (and then deleting it). To be sure the disks are always busy I have two of these running in a loop.

There are a few monitoring processes on this server:

smartd (part of smartmontools)
netdata
hddtemp
storcli64 (part of LSI/Broadcom management tools for their HBAs -- used to check HBA temp)

I decided to start disabling these one-by-one to reduce things talking to the disks. First I disabled my storcli64 script because I've had issues with it in the past. But the resets continued at roughly the same clip. So next I tried disabling smartd. Right away when I disabled smartd the frequency of the resets went down dramatically. Before I disabled smartd, resets would occur fairly reliably under load (but not on a reliable schedule). But after disabling smartd it took over 12 hours of hard writes before it occurred again. When I restart smartd, the frequency increases again.

Here is my smartd config line, for reference:

DEVICESCAN -H -f -l error -l selftest -n standby,q -m [email protected] -M exec /usr/share/smartmontools/smartd-runner -M diminishing

Note I don't do regular short/long SMART tests with smartd; it's only tracking the health status and error logs. In fact, I used to actually do these, but the automated tests would reliably cause disk resets with Seagate disks and I never did come up with a solution besides disabling automated tests. In those cases, it would affect individual disks, not the whole controller. I think when a SMART test is under way some commands may hang for longer than the kernel likes which causes the kernel to reset the disk on the HBA (a default behavior of mpt3sas).

So then I tried running smartd in the foreground in debug mode to see if anything strange was happening. Although nothing stuck out immediately, I was surprised to see the occasional note that a drive's power status transitioned when queried. Looking back in journalctl I see these quite frequently since installing the X24 disks. Here are some examples:

Oct 29 15:38:38 dwight smartd[7299]: Device: /dev/sdao [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Oct 29 15:38:43 dwight smartd[7299]: Device: /dev/sdau [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Oct 29 15:38:53 dwight smartd[7299]: Device: /dev/sday [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Oct 29 16:08:43 dwight smartd[7299]: Device: /dev/sday [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Oct 29 18:08:43 dwight smartd[7299]: Device: /dev/sdax [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
Oct 29 18:08:48 dwight smartd[7299]: Device: /dev/sday [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)

Reading the smartctI source code this line is printing the old and new power states reported by the disk when it sends its query. I didn't know what the 0x81 state was, but looked it up in the ATA spec (page 344, table 204) and it says that's EPC at Idle_A. Now that's weird because EPC is disabled on these disks. I can confirm it with Seachest across all of them. Example:

==========================================================================================
 openSeaChest_PowerControl - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_PowerControl Version: 3.0.2-2_2_3 X86_64
 Build Date: Jun 21 2021
 Today: Thu Dec  5 04:42:11 2024        User: root
==========================================================================================

/dev/sg18 - ST24000NM000C-3WD103 - ZXA0BKFL - ATA
.

===EPC Settings===
        * = timer is enabled
        C column = Changeable
        S column = Savable
        All times are in 100 milliseconds

Name       Current Timer Default Timer Saved Timer   Recovery Time C S
Idle A      0            *1            *1            1             Y Y
Idle B      0             1200          1200         4             Y Y
Idle C      0             6000          6000         20            Y Y
Standby Z   0             9000          9000         110           Y Y

So, why do these sometimes end up in Idle A? I'm not sure. I don't know if this is directly related, either, but I don't have any other logs that show the power state transitions except for smartd, which happens to show them when it does a check on all the disks (which, by default, is every 20 minutes).

Is it possible something in the firmware in the X24 disks is causing power transitions even when EPC is disabled/the timers are all set to 0?

And, why would the querying of SMART data so greatly increase the frequency of these HBA resets, and only on Seagate disks? Again if I disable smartd, the frequency drops dramatically. My theory is that other processes try querying SMART (netdata, hddtemp, maybe others) but they do so less frequently.

I will keep digging and hopefully this is useful.

Dec 05 '24 10:12 putnam

(I'd edit my post above but I think most of you guys are reading via email and you might not see it)

Apologies, late night jetlag brain here, but of all the reported power transitions almost all of them were actually WD disks. There was a single Seagate X16 that was showing a transition, and it did in fact have a timer set. I'm not sure how that happened and I've disabled it again now, but resets continue regardless.

So from the above all I can say is disabling smartd, which queries the disks roughly every 20 minutes, greatly reduces the frequency of resets. I think if I could whack-a-mole any process that checks the SMART data I could probably eliminate them entirely. But I don't really get why.

Dec 05 '24 11:12 putnam

Hi @putnam,

This is really interesting information!

And, why would the querying of SMART data so greatly increase the frequency of these HBA resets, and only on Seagate disks? Again if I disable smartd, the frequency drops dramatically. My theory is that other processes try querying SMART (netdata, hddtemp, maybe others) but they do so less frequently.

Do you know what data is being pulled each time smartd runs? Is is equivalent to the smartctx options -a or -x?

One thing I have observed in the past about resets and software to talk to drives is in every operating system you must provide a timeout value, or how long the software expects a command to take before it should be considered a failure. OpenSeaChest usually uses 15 seconds for most commands. When a command takes longer than this amount of time the OS will return an error for a command time out within about 1 second or less of this timeout value. When it returns from this timeout the OS also has to perform some amount of error recovery, so the drive is not hung for the next process to access it which ends up being a reset (COMRESET on SATA).

With that in mind I am thinking of a couple things that could be happening leading to this happening:

The drive has spun down and in the process of spinning back up it's taking longer than the command timeout value used. (Unlikely since you have essentially disabled EPC)
smartd is using a command timeout value that is too short (not sure how likely, have not looked at the code but I would be surprised if it's less than the 15 seconds we use in openSeaChest)
The drive is responding to smartd/smartctl but some piece of data that it wants is missing for some reason (log not supported, or the first few commands read from flash but one reading from disk is taking longer than expected).

openSeaChest does not have an equivalent to smartctl's -a or -x options to do a lot of things all at once, but you can add many options together to get somewhat close: openSeaChest_SMART -d <handle> --smartAttributes hybrid --showDSTLog --showSMARTErrorLog comprehensive --smartCheck --smartInfo --deviceStatistics

This is close, but not exactly the same. However, I would be curious if running this triggers anything similar to what you are seeing with smartd. As I mentioned above in point 3, some data is stored in flash and some is stored on the disc so maybe something about this is causing the issue, but I am not certain.

One other difference I know about in openSeaChest is it is coded to prefer the GPL logs over the SMART logs for DST info and SMART error log info. If I remember correctly, that is not how smartctl works (but maybe this has changed over time). Maybe if it's still querying the smart logs over the GPL logs that is another part of what is triggering this. I do not have an option to force it down the SMART log path, but I can look into it to see if it helps with debugging.

If you get a chance, can you share the output of openSeaChest_Logs -d <handle> --listSupportedLogs? This may help me understand if that is also a source of difference in how smartctl is running based on what logs the drive is supporting.

Note I don't do regular short/long SMART tests with smartd; it's only tracking the health status and error logs. In fact, I used to actually do these, but the automated tests would reliably cause disk resets with Seagate disks and I never did come up with a solution besides disabling automated tests.

I know of a bug in the Windows version of smartctl running DST in captive/foreground mode where the timeout value is too short which always ends up in a reset from the system. I do not remember if this also affected Linux....I found it months/a year ago. In background/offline mode it should not be an issue since those commands return to the host immediately after starting and can continue processing other commands while it runs. Our default in openSeaChest is to run in background/offline mode since it tends to be the most compatible, but you can force it with --captive. openSeaChest_SMART -d <handle> --shortDST --captive

We are setting the timeout for short DST in captive to 2 minutes as the spec requires and to the drive's time estimate for long (I do not recommend running long in captive mode...that will probably never complete without a reset since it can take so many hours). You can also try this to see if it triggers the same kind of reset scenario you are seeing.

Dec 09 '24 17:12 vonericsen

Hello @putnam ,

Did you finally get rid of this issue?

I ended up here with the exact same issue, but on different hardware. Despite being Seagate's area (I haven't look carefully to this OpenSeaChest tool), I'd like to share my results.

I'm running 10x WD HC560 20TB drives with LSI 9500-16i HBA card. The card firmware and driver has been upgraded to latest as of this post.

Driver version:

$ modinfo mpt3sas                                                                                              
filename:       /lib/modules/6.8.12-9-pve/updates/dkms/mpt3sas.ko                                                         
alias:          mpt2sas                                                                                                   
version:        53.00.00.00                                                                                               
license:        GPL                                                                                                       
description:    LSI MPT Fusion SAS 3.0 & SAS 3.5 Device Driver                                                            
author:         Broadcom Inc. <[email protected]>                                                          
srcversion:     50A3619DF45F7A45FF3034B

$ storcli64 /c0 show all                                                                 
CLI Version = 007.3103.0000.0000 Aug 22, 2024                                                                             
Operating system = Linux 6.8.12-9-pve                        
Controller = 0
Status = Success
Description = None


Basics :
======
Controller = 0
Adapter Type =   SAS3816(A0)
Model = HBA 9500-16i
Serial Number = SKB5091416
Current System Date/time = 03/26/2025 22:53:32
Concurrent commands supported = 4352
SAS Address =  500062b20917a0c0
PCI Address = 00:3c:00:00


Version :
=======
Firmware Package Build = 34.00.00.00
Firmware Version = 34.00.00.00 
Bios Version = 09.67.00.00_34.00.00.00
NVDATA Version = 32.02.00.11
PSOC FW Version = 0x006E
PSOC Part Number = 14790
Driver Name = mpt3sas
Driver Version = 53.00.00.00

And the error:

[Mar 26 21:23:31] mpt3sas_cm0: mpt3sas_ctl_reset_handler: Releasing the trace buffer due to adapter reset.
[Mar 26 21:23:31] mpt3sas_cm0: fault info from func: mpt3sas_base_make_ioc_ready
[Mar 26 21:23:31] mpt3sas_cm0: fault_state(0x5854)!
[Mar 26 21:23:31] mpt3sas_cm0: sending diag reset !!
[Mar 26 21:23:31] mpt3sas_cm0: diag reset: SUCCESS
[Mar 26 21:23:33] mpt3sas_cm0: IOC Number : 0
[Mar 26 21:23:33] mpt3sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
[Mar 26 21:23:33] mpt3sas_cm0: FW Package Version(34.00.00.00)
[Mar 26 21:23:33] mpt3sas_cm0: SAS3816: FWVersion(34.00.00.00), ChipRevision(0x00)
[Mar 26 21:23:33] mpt3sas_cm0: Protocol=(
[Mar 26 21:23:33] Initiator
[Mar 26 21:23:33] ,Target
[Mar 26 21:23:33] ,NVMe
[Mar 26 21:23:33] ), 
[Mar 26 21:23:33] Capabilities=(
[Mar 26 21:23:33] TLR
[Mar 26 21:23:33] ,EEDP
[Mar 26 21:23:33] ,Diag Trace Buffer
[Mar 26 21:23:33] ,Task Set Full
[Mar 26 21:23:33] ,NCQ
[Mar 26 21:23:33] )
[Mar 26 21:23:33] mpt3sas_cm0: Enable interrupt coalescing only for first 8 reply queues
[Mar 26 21:23:33] mpt3sas_cm0: performance mode: balanced
[Mar 26 21:23:33] mpt3sas_cm0: sending port enable !!
[Mar 26 21:23:54] mpt3sas_cm0: port enable: SUCCESS
[Mar 26 21:23:54] mpt3sas_cm0: search for end-devices: start
[Mar 26 21:23:54] scsi target2:0:14: handle(0x0011), sas_address(0x300162b20917a0c0), port: 0
[Mar 26 21:23:54] scsi target2:0:14: enclosure logical id(0x300062b20911a0c0), slot(16)
[Mar 26 21:23:54] scsi target2:0:1: handle(0x002f), sas_address(0x300062b20917a0c1), port: 0
[Mar 26 21:23:54] scsi target2:0:1: enclosure logical id(0x300062b20911a0c0), slot(1)
[Mar 26 21:23:54] scsi target2:0:7: handle(0x0030), sas_address(0x300062b20917a0c8), port: 8
[Mar 26 21:23:54] scsi target2:0:7: enclosure logical id(0x300062b20911a0c0), slot(8)
[Mar 26 21:23:54] scsi target2:0:8: handle(0x0037), sas_address(0x300062b20917a0c9), port: 9
[Mar 26 21:23:54] scsi target2:0:8: enclosure logical id(0x300062b20911a0c0), slot(9)
[Mar 26 21:23:54] 	handle changed from(0x0037)!!!
[Mar 26 21:23:54] scsi target2:0:9: handle(0x0038), sas_address(0x300062b20917a0ca), port: 10
[Mar 26 21:23:54] scsi target2:0:9: enclosure logical id(0x300062b20911a0c0), slot(10)
[Mar 26 21:23:54] 	handle changed from(0x0038)!!!
[Mar 26 21:23:54] scsi target2:0:11: handle(0x003a), sas_address(0x300062b20917a0cc), port: 11
[Mar 26 21:23:54] scsi target2:0:11: enclosure logical id(0x300062b20911a0c0), slot(12)
[Mar 26 21:23:54] 	handle changed from(0x003a)!!!
[Mar 26 21:23:54] scsi target2:0:10: handle(0x0039), sas_address(0x300062b20917a0cb), port: 12
[Mar 26 21:23:54] scsi target2:0:10: enclosure logical id(0x300062b20911a0c0), slot(11)
[Mar 26 21:23:54] 	handle changed from(0x0039)!!!
[Mar 26 21:23:54] scsi target2:0:2: handle(0x0031), sas_address(0x300062b20917a0c2), port: 1
[Mar 26 21:23:54] scsi target2:0:2: enclosure logical id(0x300062b20911a0c0), slot(2)
[Mar 26 21:23:54] 	handle changed from(0x0031)!!!
[Mar 26 21:23:54] scsi target2:0:12: handle(0x003c), sas_address(0x300062b20917a0cd), port: 13
[Mar 26 21:23:54] scsi target2:0:12: enclosure logical id(0x300062b20911a0c0), slot(13)
[Mar 26 21:23:54] 	handle changed from(0x003c)!!!
[Mar 26 21:23:54] scsi target2:0:13: handle(0x003b), sas_address(0x300062b20917a0ce), port: 14
[Mar 26 21:23:54] scsi target2:0:13: enclosure logical id(0x300062b20911a0c0), slot(14)
[Mar 26 21:23:54] 	handle changed from(0x003b)!!!
[Mar 26 21:23:54] scsi target2:0:0: handle(0x0032), sas_address(0x300062b20917a0c3), port: 2
[Mar 26 21:23:54] scsi target2:0:0: enclosure logical id(0x300062b20911a0c0), slot(3)
[Mar 26 21:23:54] 	handle changed from(0x0032)!!!
[Mar 26 21:23:54] scsi target2:0:3: handle(0x0033), sas_address(0x300062b20917a0c4), port: 3
[Mar 26 21:23:54] scsi target2:0:3: enclosure logical id(0x300062b20911a0c0), slot(4)
[Mar 26 21:23:54] 	handle changed from(0x0033)!!!
[Mar 26 21:23:54] scsi target2:0:4: handle(0x0034), sas_address(0x300062b20917a0c5), port: 4
[Mar 26 21:23:54] scsi target2:0:4: enclosure logical id(0x300062b20911a0c0), slot(5)
[Mar 26 21:23:54] 	handle changed from(0x0034)!!!
[Mar 26 21:23:54] scsi target2:0:5: handle(0x0035), sas_address(0x300062b20917a0c6), port: 5
[Mar 26 21:23:54] scsi target2:0:5: enclosure logical id(0x300062b20911a0c0), slot(6)
[Mar 26 21:23:54] 	handle changed from(0x0035)!!!
[Mar 26 21:23:54] scsi target2:0:6: handle(0x0036), sas_address(0x300062b20917a0c7), port: 6
[Mar 26 21:23:54] scsi target2:0:6: enclosure logical id(0x300062b20911a0c0), slot(7)
[Mar 26 21:23:54] 	handle changed from(0x0036)!!!
[Mar 26 21:23:54] mpt3sas_cm0: 	break from _scsih_search_responding_sas_devices: ioc_status(0x0022), loginfo(0x310f0400)
[Mar 26 21:23:54] mpt3sas_cm0: search for end-devices: complete
[Mar 26 21:23:54] mpt3sas_cm0: search for end-devices: start
[Mar 26 21:23:54] mpt3sas_cm0: search for PCIe end-devices: complete
[Mar 26 21:23:54] mpt3sas_cm0: search for expanders: start
[Mar 26 21:23:54] mpt3sas_cm0: search for expanders: complete
[Mar 26 21:23:54] mpt3sas_cm0: mpt3sas_base_hard_reset_handler: SUCCESS
[Mar 26 21:23:54] mpt3sas_cm0: _base_fault_reset_work: hard reset: success
[Mar 26 21:23:54] sd 2:0:11:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:0:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:3:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:4:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:5:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:8:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:9:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:10:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:7:0: Power-on or device reset occurred
[Mar 26 21:23:54] sd 2:0:6:0: Power-on or device reset occurred
[Mar 26 21:23:55] mpt3sas_cm0: removing unresponding devices: start
[Mar 26 21:23:55] mpt3sas_cm0: removing unresponding devices: sas end-devices
[Mar 26 21:23:55] mpt3sas_cm0: removing unresponding devices: pcie end-devices
[Mar 26 21:23:55] mpt3sas_cm0: removing unresponding devices: expanders
[Mar 26 21:23:55] mpt3sas_cm0: removing unresponding devices: complete
[Mar 26 21:23:55] mpt3sas_cm0: Update Devices with FW Reported QD 
[Mar 26 21:23:55] sd 2:0:0:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:1:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:2:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:3:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:4:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:5:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:6:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:7:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:8:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:9:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:10:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:11:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:12:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] sd 2:0:13:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[Mar 26 21:23:55] ses 2:0:14:0: qdepth(1), tagged(0), scsi_level(8), cmd_que(0)
[Mar 26 21:23:55] mpt3sas_cm0: scan devices: start
[Mar 26 21:23:55] mpt3sas_cm0: 	scan devices: expanders start
[Mar 26 21:23:55] mpt3sas_cm0: 	break from expander scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mar 26 21:23:55] mpt3sas_cm0: 	scan devices: expanders complete
[Mar 26 21:23:55] mpt3sas_cm0: 	scan devices: sas end devices start
[Mar 26 21:23:55] mpt3sas_cm0: 	break from sas end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mar 26 21:23:55] mpt3sas_cm0: 	scan devices: sas end devices complete
[Mar 26 21:23:55] mpt3sas_cm0: 	scan devices: pcie end devices start
[Mar 26 21:23:55] mpt3sas_cm0: 	break from pcie end device scan: ioc_status(0x0022), loginfo(0x310f0400)
[Mar 26 21:23:55] mpt3sas_cm0: 	pcie devices: pcie end devices complete
[Mar 26 21:23:55] mpt3sas_cm0: scan devices: complete

I found this issue on stock driver (48.00.00.00) and stock firmware (14.00.00.00, which is ancient...) during a bulk zfs send/receive mission, then upgraded to what I posed above, the issue is still there. Running fio test can also trigger it. It seems it requires extreame load to trigger it though, because it never happens during normal operation.

I can confirm that disabling smartd reduces the frequrency of reset, but it'll eventually occur in the long run. And after several resets my machine completely blocks all I/O and enter frozen state. Only hard reset could bring it out.

Regards

Mar 26 '25 15:03 CkovMk

Hello @putnam and @CkovMk,

I have been continuing to try to figure out what else to look at and I found a contact who suggested checking the power to the drives as seen in this forum: https://forums.servethehome.com/index.php?threads/solved-reset-issues-causing-dropping-drives.44519/ The summary my contact gave me is when drive bays share the same power line the issue is the voltage drop across the drives. A 5.1V supply may provide something like 4.94v when 8 drives are all idle but may only provide 4.74v when they are all reading. This drop may be just enough to go below the 5% threshold causing other resets, etc.

I know keeping the HBA firmware up to date helps a lot as Seagate has seen issues in some circumstances with downrev HBA firmware not working quite right, but if that is already up to date, this is the next thing to look at.

To assist with checking the voltage you can try a FARM log option I have recently added in the develop branch to openSeaChest_SMART. It's not officially rolled out as it is still being tested; the option is --showFARM. smartmontools also has a way to show the Seagate FARM log, but I don't recall the option in smartmontools off the top of my head. There is a field some drives support (depends on which version of FARM made it into firmware. Newer drives and firmware get more information typically) which shows the current 5v and 12v power as well as min and max within a time range that the measurement occurred between (another field in FARM. We are making our output show this near the applicable fields).

Mar 28 '25 20:03 vonericsen

One other thing that may be helpful in tracking this down is also checking the output of --deviceStatistics for the "Number of ASR Events" to see if that is increasing.

This statistic increments each time the device performs Asynchronous Signal Recovery. The loss of signal may also trigger additional resets. One example the SATA spec gives is if the voltage on the SATA phy drops too low and the device loses synchronization, then the device will send a COMINIT to the host/HBA to recover from this error. Once that is received the HBA sends a COMRESET to return the drive (and bus) to a known state once more. It is possible that this may also be what is happening, so check and see if this statistic is incrementing when you are seeing these issues.

Mar 31 '25 19:03 vonericsen

@putnam what are your /sys/block/<disk>/device/timeout and eh_timeout values? Also, what SCTERC timeout do you have set? It looks like you have very short timeouts set from your log (outstanding for 2948 ms & timeout 1000 ms).

SCTERC timeouts (smartctl -l scterc) should be shorter than linux/block layer timeouts and be sufficiently long to handle heavy writes.

https://wiki.tnonline.net/w/Linux/SCT_Error_Timeout
https://git.tnonline.net/Forza/misc/src/branch/main/scsi-timeout/scterc.sh

Jun 09 '25 22:06 Forza-tng

Sorry for the silence on this ticket. I have some progress to share on this issue. I put in a spare SAS3008 HBA PCIe card (9300-8i) and tested the same load, now for months, without this reproducing. As a result it fell into the back of my mind and I mentally moved on, but should have updated here.

Both the on-board HBA and the card are on the same firmware (16.00.12.00). I used the same cables that were plugged into the motherboard, just swapped them to the HBA.

So my only guess is there's something wrong with the HBA on my motherboard, the Supermicro H12SSL-CT. It is a little weird that it seemed to not repro on the WD/HGST drives though. There may be some intersection there where whatever was wrong with the on-board HBA only repros with Seagate firmwares. I don't know.

The thing I have not changed though is the smartctl testing; data is being collected, but I still have the daily short tests disabled.

For @vonericsen

Here's my smartmontools line that hasn't changed this whole time since I disabled self tests: DEVICESCAN -H -f -l error -l selftest -n standby,q -m [email protected] -M exec /usr/share/smartmontools/smartd-runner -M diminishing

Here are the supported logs on each type of disk.

X16 Log Types

==========================================================================================
 openSeaChest_Logs - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Logs Version: 2.0.1-2_2_3 X86_64
 Build Date: Jun 21 2021
 Today: Sun Oct  5 03:09:38 2025        User: root
==========================================================================================

 - ST16000NM000J-2TW103 - xxxxxxx- ATA

Access Types:
------------
  SL - SMART Log
 GPL - General Purpose Log


  Log Address  :   # of Pages   :  Size (Bytes) :   Access
---------------:----------------:---------------:-------------
     0 (00h)   :     1          :    512        :   SL, GPL
     1 (01h)   :     1          :    512        :   SL
     2 (02h)   :     5          :    2560       :   SL
     3 (03h)   :     5          :    2560       :   GPL
     4 (04h)   :     8          :    4096       :   SL
     6 (06h)   :     1          :    512        :   SL
     7 (07h)   :     1          :    512        :   GPL
     8 (08h)   :     2          :    1024       :   GPL
     9 (09h)   :     1          :    512        :   SL
    10 (0Ah)   :     8          :    4096       :   GPL
    16 (10h)   :     1          :    512        :   GPL
    17 (11h)   :     1          :    512        :   GPL
    19 (13h)   :     1          :    512        :   GPL
    33 (21h)   :     1          :    512        :   GPL
    34 (22h)   :     1          :    512        :   GPL
    47 (2Fh)   :     1          :    512        :   GPL
    48 (30h)   :     9          :    4608       :   SL, GPL
                ------------------
                HOST SPECIFIC LOGS
                ------------------
   128 (80h)   :     16         :    8192       :   SL, GPL
   129 (81h)   :     16         :    8192       :   SL, GPL
   130 (82h)   :     16         :    8192       :   SL, GPL
   131 (83h)   :     16         :    8192       :   SL, GPL
   132 (84h)   :     16         :    8192       :   SL, GPL
   133 (85h)   :     16         :    8192       :   SL, GPL
   134 (86h)   :     16         :    8192       :   SL, GPL
   135 (87h)   :     16         :    8192       :   SL, GPL
   136 (88h)   :     16         :    8192       :   SL, GPL
   137 (89h)   :     16         :    8192       :   SL, GPL
   138 (8Ah)   :     16         :    8192       :   SL, GPL
   139 (8Bh)   :     16         :    8192       :   SL, GPL
   140 (8Ch)   :     16         :    8192       :   SL, GPL
   141 (8Dh)   :     16         :    8192       :   SL, GPL
   142 (8Eh)   :     16         :    8192       :   SL, GPL
   143 (8Fh)   :     16         :    8192       :   SL, GPL
   144 (90h)   :     16         :    8192       :   SL, GPL
   145 (91h)   :     16         :    8192       :   SL, GPL
   146 (92h)   :     16         :    8192       :   SL, GPL
   147 (93h)   :     16         :    8192       :   SL, GPL
   148 (94h)   :     16         :    8192       :   SL, GPL
   149 (95h)   :     16         :    8192       :   SL, GPL
   150 (96h)   :     16         :    8192       :   SL, GPL
   151 (97h)   :     16         :    8192       :   SL, GPL
   152 (98h)   :     16         :    8192       :   SL, GPL
   153 (99h)   :     16         :    8192       :   SL, GPL
   154 (9Ah)   :     16         :    8192       :   SL, GPL
   155 (9Bh)   :     16         :    8192       :   SL, GPL
   156 (9Ch)   :     16         :    8192       :   SL, GPL
   157 (9Dh)   :     16         :    8192       :   SL, GPL
   158 (9Eh)   :     16         :    8192       :   SL, GPL
   159 (9Fh)   :     16         :    8192       :   SL, GPL
                ------------------
        DEVICE VENDOR SPECIFIC LOGS
                ------------------
   161 (A1h)   :     32         :    16384      :   SL, GPL
   162 (A2h)   :     64         :    32768      :   GPL
   164 (A4h)   :     32         :    16384      :   SL, GPL
   166 (A6h)   :     64         :    32768      :   GPL
   168 (A8h)   :     8          :    4096       :   SL, GPL
   169 (A9h)   :     8          :    4096       :   SL, GPL
   171 (ABh)   :     1          :    512        :   GPL
   173 (ADh)   :     16         :    8192       :   GPL
   177 (B1h)   :     32         :    16384      :   SL, GPL
   190 (BEh)   :     127        :    65024      :   GPL
   191 (BFh)   :     127        :    65024      :   GPL
   193 (C1h)   :     8          :    4096       :   SL, GPL
   195 (C3h)   :     24         :    12288      :   SL, GPL
   198 (C6h)   :     64         :    32768      :   GPL
   199 (C7h)   :     8          :    4096       :   SL, GPL
   201 (C9h)   :     8          :    4096       :   SL, GPL
   202 (CAh)   :     16         :    8192       :   SL, GPL
   205 (CDh)   :     1          :    512        :   SL, GPL
   206 (CEh)   :     1          :    512        :   GPL
   209 (D1h)   :     16         :    8192       :   GPL
   210 (D2h)   :     16         :    8192       :   GPL
   218 (DAh)   :     1          :    512        :   SL, GPL
                ------------------
   224 (E0h)   :     1          :    512        :   SL, GPL
   225 (E1h)   :     1          :    512        :   SL, GPL

X24 Log Types

==========================================================================================
 openSeaChest_Logs - openSeaChest drive utilities - NVMe Enabled
 Copyright (c) 2014-2021 Seagate Technology LLC and/or its Affiliates, All Rights Reserved
 openSeaChest_Logs Version: 2.0.1-2_2_3 X86_64
 Build Date: Jun 21 2021
 Today: Sun Oct  5 03:12:05 2025        User: root
==========================================================================================

 - ST24000NM000C-3WD103 - xxxxxxxx- ATA

Access Types:
------------
  SL - SMART Log
 GPL - General Purpose Log


  Log Address  :   # of Pages   :  Size (Bytes) :   Access
---------------:----------------:---------------:-------------
     0 (00h)   :     1          :    512        :   SL, GPL
     1 (01h)   :     1          :    512        :   SL
     2 (02h)   :     5          :    2560       :   SL
     3 (03h)   :     5          :    2560       :   GPL
     4 (04h)   :     8          :    4096       :   SL
     6 (06h)   :     1          :    512        :   SL
     7 (07h)   :     1          :    512        :   GPL
     8 (08h)   :     2          :    1024       :   GPL
     9 (09h)   :     1          :    512        :   SL
    10 (0Ah)   :     8          :    4096       :   GPL
    15 (0Fh)   :     2          :    1024       :   GPL
    16 (10h)   :     1          :    512        :   GPL
    17 (11h)   :     1          :    512        :   GPL
    19 (13h)   :     1          :    512        :   GPL
    33 (21h)   :     1          :    512        :   GPL
    34 (22h)   :     1          :    512        :   GPL
    47 (2Fh)   :     1          :    512        :   GPL
    48 (30h)   :     9          :    4608       :   SL, GPL
                ------------------
                HOST SPECIFIC LOGS
                ------------------
   128 (80h)   :     16         :    8192       :   SL, GPL
   129 (81h)   :     16         :    8192       :   SL, GPL
   130 (82h)   :     16         :    8192       :   SL, GPL
   131 (83h)   :     16         :    8192       :   SL, GPL
   132 (84h)   :     16         :    8192       :   SL, GPL
   133 (85h)   :     16         :    8192       :   SL, GPL
   134 (86h)   :     16         :    8192       :   SL, GPL
   135 (87h)   :     16         :    8192       :   SL, GPL
   136 (88h)   :     16         :    8192       :   SL, GPL
   137 (89h)   :     16         :    8192       :   SL, GPL
   138 (8Ah)   :     16         :    8192       :   SL, GPL
   139 (8Bh)   :     16         :    8192       :   SL, GPL
   140 (8Ch)   :     16         :    8192       :   SL, GPL
   141 (8Dh)   :     16         :    8192       :   SL, GPL
   142 (8Eh)   :     16         :    8192       :   SL, GPL
   143 (8Fh)   :     16         :    8192       :   SL, GPL
   144 (90h)   :     16         :    8192       :   SL, GPL
   145 (91h)   :     16         :    8192       :   SL, GPL
   146 (92h)   :     16         :    8192       :   SL, GPL
   147 (93h)   :     16         :    8192       :   SL, GPL
   148 (94h)   :     16         :    8192       :   SL, GPL
   149 (95h)   :     16         :    8192       :   SL, GPL
   150 (96h)   :     16         :    8192       :   SL, GPL
   151 (97h)   :     16         :    8192       :   SL, GPL
   152 (98h)   :     16         :    8192       :   SL, GPL
   153 (99h)   :     16         :    8192       :   SL, GPL
   154 (9Ah)   :     16         :    8192       :   SL, GPL
   155 (9Bh)   :     16         :    8192       :   SL, GPL
   156 (9Ch)   :     16         :    8192       :   SL, GPL
   157 (9Dh)   :     16         :    8192       :   SL, GPL
   158 (9Eh)   :     16         :    8192       :   SL, GPL
   159 (9Fh)   :     16         :    8192       :   SL, GPL
                ------------------
        DEVICE VENDOR SPECIFIC LOGS
                ------------------
   161 (A1h)   :     32         :    16384      :   SL, GPL
   162 (A2h)   :     64         :    32768      :   GPL
   164 (A4h)   :     32         :    16384      :   SL, GPL
   166 (A6h)   :     64         :    32768      :   GPL
   168 (A8h)   :     8          :    4096       :   SL, GPL
   169 (A9h)   :     8          :    4096       :   SL, GPL
   171 (ABh)   :     1          :    512        :   GPL
   173 (ADh)   :     16         :    8192       :   GPL
   177 (B1h)   :     32         :    16384      :   SL, GPL
   180 (B4h)   :     16         :    8192       :   SL, GPL
   188 (BCh)   :     1          :    512        :   GPL
   190 (BEh)   :     127        :    65024      :   GPL
   191 (BFh)   :     127        :    65024      :   GPL
   193 (C1h)   :     8          :    4096       :   SL, GPL
   195 (C3h)   :     96         :    49152      :   SL, GPL
   198 (C6h)   :     64         :    32768      :   GPL
   199 (C7h)   :     8          :    4096       :   SL, GPL
   201 (C9h)   :     8          :    4096       :   SL, GPL
   202 (CAh)   :     16         :    8192       :   SL, GPL
   205 (CDh)   :     1          :    512        :   SL, GPL
   206 (CEh)   :     1          :    512        :   GPL
   209 (D1h)   :     104        :    53248      :   GPL
   210 (D2h)   :     16         :    8192       :   GPL
   218 (DAh)   :     1          :    512        :   SL, GPL
                ------------------
   224 (E0h)   :     1          :    512        :   SL, GPL
   225 (E1h)   :     1          :    512        :   SL, GPL

Regarding power: I have a very powerful single-rail power supply in this Supermicro chassis that is surely used in hundreds of thousands of units worldwide, and I'd imagine the backplane is also quite popular. I will still check this if I repro again.

Regarding signal integrity issues: I'm thinking that if the on-board HBA is not actually failing, perhaps the connectors are messed up in some way. Again I will follow your instructions here to check it should we re-use the onboard HBA again. Very useful debugging info for all reading, though.

For @Forza-tng

X24 disks are timeout=30, eh_timeout=10, and SCTERC is set to 70 (7.0 seconds) for both read/write. X16 disks are timeout=30, eh_timeout=10, and SCTERC is set to 100 (10.0 seconds) for both read/write.

For comparison, all of my WD disks (18TB and 20TB) have the same kernel timeouts (30/10) but SCTERC shows Disabled for both read/write.

I haven't modified any of the above settings. Is this unexpected?

Oct 05 '25 08:10 putnam

Perhaps the errors are due to ASPM on the pcie bus or pcie port. Using an addon card may handle this better, or avoids a problematic pcie port/root, so you don't get the problem.

Try appending pcie_port_pm=off pcie_aspm=off to the kernels cmdline. https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

Oct 05 '25 10:10 Forza-tng