snap-01 has failing memory
Snap-01 has a failing DIMM throwing ECC Correction errors.
CPU_SrcID#1_MC#1_Chan#0_DIMM#0
Which is I think the one marked from this hardware linkage table:
memory stick 'P1-DIMMA1' is located at 'P0_Node0_Channel0_Dimm0'
memory stick 'P1-DIMMA2' is located at 'P0_Node0_Channel0_Dimm1'
memory stick 'P1-DIMMB1' is located at 'P0_Node0_Channel1_Dimm0'
memory stick 'P1-DIMMB2' is located at 'P0_Node0_Channel1_Dimm1'
memory stick 'P1-DIMMC1' is located at 'P0_Node0_Channel2_Dimm0'
memory stick 'P1-DIMMC2' is located at 'P0_Node0_Channel2_Dimm1'
memory stick 'P1-DIMMD1' is located at 'P0_Node1_Channel0_Dimm0'
memory stick 'P1-DIMMD2' is located at 'P0_Node1_Channel0_Dimm1'
memory stick 'P1-DIMME1' is located at 'P0_Node1_Channel1_Dimm0'
memory stick 'P1-DIMME2' is located at 'P0_Node1_Channel1_Dimm1'
memory stick 'P1-DIMMF1' is located at 'P0_Node1_Channel2_Dimm0'
memory stick 'P1-DIMMF2' is located at 'P0_Node1_Channel2_Dimm1'
memory stick 'P2-DIMMA1' is located at 'P1_Node0_Channel0_Dimm0'
memory stick 'P2-DIMMA2' is located at 'P1_Node0_Channel0_Dimm1'
memory stick 'P2-DIMMB1' is located at 'P1_Node0_Channel1_Dimm0'
memory stick 'P2-DIMMB2' is located at 'P1_Node0_Channel1_Dimm1'
memory stick 'P2-DIMMC1' is located at 'P1_Node0_Channel2_Dimm0'
memory stick 'P2-DIMMC2' is located at 'P1_Node0_Channel2_Dimm1'
memory stick 'P2-DIMMD1' is located at 'P1_Node1_Channel0_Dimm0' ****
memory stick 'P2-DIMMD2' is located at 'P1_Node1_Channel0_Dimm1'
memory stick 'P2-DIMME1' is located at 'P1_Node1_Channel1_Dimm0'
memory stick 'P2-DIMME2' is located at 'P1_Node1_Channel1_Dimm1'
memory stick 'P2-DIMMF1' is located at 'P1_Node1_Channel2_Dimm0'
memory stick 'P2-DIMMF2' is located at 'P1_Node1_Channel2_Dimm1'
DMI lists the memory as:
Handle 0x0035, DMI type 17, 84 bytes
Memory Device
Array Handle: 0x0033
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: P2-DIMMD1
Bank Locator: P1_Node1_Channel0_Dimm0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2666 MT/s
Manufacturer: Micron Technology
Serial Number: F0E34EF7
Asset Tag: P2-DIMMD1_AssetTag (date:20/01)
Part Number: 36ASF4G72PZ-2G6E1
Rank: 2
Configured Memory Speed: 2400 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: 0000
Module Manufacturer ID: Bank 1, Hex 0x2C
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 32 GB
Cache Size: None
Logical Size: None
I have ordered 2x replacement DIMMs. They should arrive in Catford shortly.
As soon as plausible I would like to reboot the system to ensure that ADDDC is enabled in the BIOS.
On a successful boot with ADDDC enabled I would then like to upgrade the BIOS to the latest revision 3.2 -> 4.2. snap-02 has already been upgraded.
I don't want to jinx it, but it looks like the memory errors have stopped for now. Note to reader: Corrected ECC Errors, not Uncorrected ECC errors.
We scheduled a 1 hour maintenance today where I performed the following:
- Rebooted into BIOS and Enabled: "Enhanced PPR" (PPR aka "Post Package Repair". Enables an extended memory test on Boot / POST which allows internal DDR4 re-mapping to spares). "Enhanced PPR" appears to be a Supermicro proprietary option. Extended POST by 6 minutes while running against 512GB of RAM.
- Updated BIOS to latest release. IPMI/BMC/OOB done previously.
- Enabled RAS option "ADC Sparing", ~~cannot find documentation for this~~. Maybe ADDDC mislabelled? Regardless, Xeon Scalable Silver appear not to support ADDDC. Found the documentation: "The Silver/Bronze SKUs offer Adaptive Data Correction (ADC [SR]), at Bank granularity, and the Platinum/Gold SKUs offer Adaptive Double DRAM Device Correction (ADDDC [MR]), at Bank and Rank granularity, with additional hardware facilities for device map-out."
- Ran another "Enhanced PPR" for good measure.
All options above were first tested on the twin snap-02.
We discussed the RAM replacement at the 11 July 2024 Ops call. We will aim to replace the memory in the server in the next 3 months. The server is no longer throwing errors and is not urgent priority.
In the event the RAM starts throwing errors we will treat it as urgent.
2x DIMMs are in-stock @ Catford.
Unfortunately not possible to tell what revision is insallled in snap-01. Stock is 2 different revisions.
I've been able to identify the FULL RAM model + revision: 36ASF4G72PZ-2G6E1QG from photos.
Unfortunately neither of those I've ordered are an exact match.
Exact match: https://www.ebay.nl/itm/155164317853
I have ordered the exact memory module. It will arrive in Catford in a few days.
Matching memory module has arrived in Catford stock.
Memory ready and maintenance window scheduled for today: https://community.openstreetmap.org/t/openstreetmap-maintenance-26-september-2024/118989
Memory replaced.