operations icon indicating copy to clipboard operation
operations copied to clipboard

snap-01 has failing memory

Open Firefishy opened this issue 1 year ago • 2 comments

Snap-01 has a failing DIMM throwing ECC Correction errors.

CPU_SrcID#1_MC#1_Chan#0_DIMM#0

Which is I think the one marked from this hardware linkage table:

memory stick 'P1-DIMMA1' is located at 'P0_Node0_Channel0_Dimm0'
memory stick 'P1-DIMMA2' is located at 'P0_Node0_Channel0_Dimm1'
memory stick 'P1-DIMMB1' is located at 'P0_Node0_Channel1_Dimm0'
memory stick 'P1-DIMMB2' is located at 'P0_Node0_Channel1_Dimm1'
memory stick 'P1-DIMMC1' is located at 'P0_Node0_Channel2_Dimm0'
memory stick 'P1-DIMMC2' is located at 'P0_Node0_Channel2_Dimm1'

memory stick 'P1-DIMMD1' is located at 'P0_Node1_Channel0_Dimm0'
memory stick 'P1-DIMMD2' is located at 'P0_Node1_Channel0_Dimm1'
memory stick 'P1-DIMME1' is located at 'P0_Node1_Channel1_Dimm0'
memory stick 'P1-DIMME2' is located at 'P0_Node1_Channel1_Dimm1'
memory stick 'P1-DIMMF1' is located at 'P0_Node1_Channel2_Dimm0'
memory stick 'P1-DIMMF2' is located at 'P0_Node1_Channel2_Dimm1'

memory stick 'P2-DIMMA1' is located at 'P1_Node0_Channel0_Dimm0'
memory stick 'P2-DIMMA2' is located at 'P1_Node0_Channel0_Dimm1'
memory stick 'P2-DIMMB1' is located at 'P1_Node0_Channel1_Dimm0'
memory stick 'P2-DIMMB2' is located at 'P1_Node0_Channel1_Dimm1'
memory stick 'P2-DIMMC1' is located at 'P1_Node0_Channel2_Dimm0'
memory stick 'P2-DIMMC2' is located at 'P1_Node0_Channel2_Dimm1'

memory stick 'P2-DIMMD1' is located at 'P1_Node1_Channel0_Dimm0' ****
memory stick 'P2-DIMMD2' is located at 'P1_Node1_Channel0_Dimm1'
memory stick 'P2-DIMME1' is located at 'P1_Node1_Channel1_Dimm0'
memory stick 'P2-DIMME2' is located at 'P1_Node1_Channel1_Dimm1'
memory stick 'P2-DIMMF1' is located at 'P1_Node1_Channel2_Dimm0'
memory stick 'P2-DIMMF2' is located at 'P1_Node1_Channel2_Dimm1'

DMI lists the memory as:

Handle 0x0035, DMI type 17, 84 bytes
Memory Device
        Array Handle: 0x0033
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 32 GB
        Form Factor: DIMM
        Set: None
        Locator: P2-DIMMD1
        Bank Locator: P1_Node1_Channel0_Dimm0
        Type: DDR4
        Type Detail: Synchronous Registered (Buffered)
        Speed: 2666 MT/s
        Manufacturer: Micron Technology
        Serial Number: F0E34EF7
        Asset Tag: P2-DIMMD1_AssetTag (date:20/01)
        Part Number: 36ASF4G72PZ-2G6E1
        Rank: 2
        Configured Memory Speed: 2400 MT/s
        Minimum Voltage: 1.2 V
        Maximum Voltage: 1.2 V
        Configured Voltage: 1.2 V
        Memory Technology: DRAM
        Memory Operating Mode Capability: Volatile memory
        Firmware Version: 0000
        Module Manufacturer ID: Bank 1, Hex 0x2C
        Module Product ID: Unknown
        Memory Subsystem Controller Manufacturer ID: Unknown
        Memory Subsystem Controller Product ID: Unknown
        Non-Volatile Size: None
        Volatile Size: 32 GB
        Cache Size: None
        Logical Size: None

Firefishy avatar Jul 09 '24 04:07 Firefishy

I have ordered 2x replacement DIMMs. They should arrive in Catford shortly.

Firefishy avatar Jul 09 '24 04:07 Firefishy

As soon as plausible I would like to reboot the system to ensure that ADDDC is enabled in the BIOS.

On a successful boot with ADDDC enabled I would then like to upgrade the BIOS to the latest revision 3.2 -> 4.2. snap-02 has already been upgraded.

Firefishy avatar Jul 09 '24 04:07 Firefishy

I don't want to jinx it, but it looks like the memory errors have stopped for now. Note to reader: Corrected ECC Errors, not Uncorrected ECC errors.

We scheduled a 1 hour maintenance today where I performed the following:

  • Rebooted into BIOS and Enabled: "Enhanced PPR" (PPR aka "Post Package Repair". Enables an extended memory test on Boot / POST which allows internal DDR4 re-mapping to spares). "Enhanced PPR" appears to be a Supermicro proprietary option. Extended POST by 6 minutes while running against 512GB of RAM.
  • Updated BIOS to latest release. IPMI/BMC/OOB done previously.
  • Enabled RAS option "ADC Sparing", ~~cannot find documentation for this~~. Maybe ADDDC mislabelled? Regardless, Xeon Scalable Silver appear not to support ADDDC. Found the documentation: "The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out."
  • Ran another "Enhanced PPR" for good measure.

All options above were first tested on the twin snap-02.

Firefishy avatar Jul 10 '24 22:07 Firefishy

We discussed the RAM replacement at the 11 July 2024 Ops call. We will aim to replace the memory in the server in the next 3 months. The server is no longer throwing errors and is not urgent priority.

Firefishy avatar Jul 12 '24 10:07 Firefishy

In the event the RAM starts throwing errors we will treat it as urgent.

Firefishy avatar Jul 12 '24 10:07 Firefishy

2x DIMMs are in-stock @ Catford.

Unfortunately not possible to tell what revision is insallled in snap-01. Stock is 2 different revisions.

Firefishy avatar Jul 12 '24 10:07 Firefishy

I've been able to identify the FULL RAM model + revision: 36ASF4G72PZ-2G6E1QG from photos. Unfortunately neither of those I've ordered are an exact match.

Exact match: https://www.ebay.nl/itm/155164317853

Firefishy avatar Jul 19 '24 22:07 Firefishy

I have ordered the exact memory module. It will arrive in Catford in a few days.

Firefishy avatar Aug 17 '24 09:08 Firefishy

Matching memory module has arrived in Catford stock.

Firefishy avatar Aug 20 '24 14:08 Firefishy

Memory ready and maintenance window scheduled for today: https://community.openstreetmap.org/t/openstreetmap-maintenance-26-september-2024/118989

Firefishy avatar Sep 26 '24 17:09 Firefishy

Memory replaced.

Firefishy avatar Sep 26 '24 21:09 Firefishy