openSeaChest icon indicating copy to clipboard operation
openSeaChest copied to clipboard

Exos X16 fails to change sector size on a Supermicro server

Open danderson opened this issue 2 years ago • 13 comments

I'm doing initial setup on some new ST16000NM003G drives (Exos X16 16TB SATA). openSeaChest_Format -d /dev/sdi --showSupportedFormats says the drives support 4096b sectors, and are currently configured with 512b sectors. However, attempting to change the sector size fails with Set Sector Configuration Ext returning: ABORTED.

Hardware-wise, the drive is connected to a Supermicro SSG-5028R-E1CR12LA-CE010 server. The device chain from CPU to drive is:

  • X10SRH-CLN4F motherboard
  • Supermicro AOC-S3008L-L8e SAS3 HBA (based on LSI/BCM 3008 IC)
  • BPN-SAS3-826EL1 SAS expander backplane (based on LSISASx28 expander IC)

Searching the issue tracker, I believe I'm seeing exactly the same symptoms as https://github.com/Seagate/openSeaChest/issues/79 , although possibly with slightly different hardware (X10 motherboard instead of X11, but also an LSI/BCM 3008 HBA, and also a supermicro server so likely similar backplane SAS expander).

I've attached the output of openSeaChest_Info -d /dev/sdi -i, openSeaChest_Format -d /dev/sdi --showSupportedFormats, and openSeaChest_Format -d /dev/sdi --setSectorSize=4096 --confirm this-will-erase-data-and-may-render-the-drive-inoperable.

sdi-info.txt sdi-supportedformats.txt sdi-format.txt

The linked issue has a workaround (execute the sector reconfig from a different system without all the LSI, Supermicro and SAS<>SATA stuff in the chain), so really I'm filing this issue to ask: is there any more data I could provide you to get to get more insight into this issue? Given that I can apparently reproduce it, and I'm going to be doing destructive burn-in on these drives for a few days, I can run debug commands and invasive drive changes without harming data.

danderson avatar Jul 11 '23 19:07 danderson

Reproducing relevant info from #79, so people don't have to go digging: in that bug the reporter had a Supermicro X11DPH-T motherboard, and the same Supermicro AOC-S3008L-L8e HBA as me. No info on the backplane in that bug, but given Supermicro's product lineup, it seems likely that it's the same expander backplane as my system, since those boards don't change much even between different server models.

danderson avatar Jul 11 '23 19:07 danderson

One more datapoint: I moved one of the drives to an older Supermicro server with a SAS2 storage chain, and I was able to change the sector size there successfully. Listing the hardware in that server too, just in case the A/B datapoints help:

  • Motherboard: Supermicro X10SLM+-LN4F
  • HBA: Broadcom / LSI 9211-8i
  • Backplane: Supermicro BPN-SAS2-826EL1 (based on LSI SAS2X28 expander IC)

This server is a franken-machine assembled from a used chassis+backplane, motherboard and HBA. This is not a configuration sold by Supermicro directly (whereas the one in my original report, afaik, is).

danderson avatar Jul 11 '23 20:07 danderson

Hi @danderson, Thanks for the logs, I will take a look and see if I find something else that might help track this down. While debugging #79, I asked Seagate's engineer who works with Supermicro to test the Supermicro hardware we have and he could not repeat it. Seagate's engineer asked Supermicro's lab to also see if they could repeat this issue, but we never got it to repeat with the same hardware that was reported in that issue...so we really do not know what the issue is.

vonericsen avatar Jul 20 '23 21:07 vonericsen

Thanks for taking a look! I don't envy having to track this through all the layers to find where things are going wrong.

I filed this purely in case it provides additional clues, or if I can provide further data about the configuration that wasn't working. If that's not the case, then I'm happy to close this bug as there's only so much digging that's possible across multiple vendors like this.

danderson avatar Jul 21 '23 17:07 danderson

I reviewed the logs and I cannot figure out what would be wrong right now. Everything is being populated in the command correctly according to the specifications.

I've asked to see if someone in Seagate's firmware group can help me understand the spec's abort reason "the device is unable to complete processing of the command" to see if that can help me track it back to a feature interaction or something else in the firmware that I may be able to control. The other cases for the command abort from the spec are not the issue since the fields are all being filled in properly (unless for some reason the HBA firmware is filtering them out on the bus, but you would need a bus trace to see this).

The only other thing I can think of while I dig backwards is have you tried updating the HBA firmware at all? I'm not sure if it will fix it, but sometimes updating HBA firmware resolves odd things like this. In #111, updating the HBA firmware resolved a strange bug where the drive was not going into the idle or standby modes like it should. Maybe there is something similar going on here and causing the drive to think it cannot do the fast format right now because of some other bus activity from the HBA.

vonericsen avatar Jul 21 '23 18:07 vonericsen