mcelog Numbering of CPU and bank

/var/log/mcelog contains the following. This happens on Ubuntu 16.04 reporting the mcelog version as (128+dfsg-1).

Hardware event. This is not a software error.
MCE 0
CPU 1 BANK 8 TSC 235983e523450 
MISC 2000000a6646 ADDR 93e6e4300 
TIME 1603741601 Mon Oct 26 20:46:41 2020
MCG status:
MCi status:
Corrected error
MCi_MISC register valid
MCi_ADDR register valid
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 46
Memory DIMM ID of error: 2
Memory channel ID of error: 2
Memory ECC syndrome: 2000
STATUS 8c0000400001009f MCGSTATUS 0
MCGCAP 1c09 APICID 20 SOCKETID 1 
CPUID Vendor Intel Family 6 Model 44

I have been trying to switch the DIMM in memory bank 8 at CPU 1 as labeled on the motherboard. However, that particular kind of error has been reported again at the same location (CPU 1 BANK 8). Before wildly switching DIMMs around, I am hoping that somebody might be able to tell me what kind of numbering mcelog uses, starting from zero or from one (presumably the same for all kinds of objects).

For comparison, dmidecode uses labels like PROC {1,2} DIMM {1..9} which would make the numbering from one an obvious candidate. However, I have seen examples of mcelog counting the CPUs from zero. As for using both numberings, lshw lists cpu:{0,1} in slot: Proc {1,2} and memory:{0,1} with bank:{0..8} as physical id: {0..9} in PROC {1,2} DIMM {1..9}. Finally, it could even depend on the kind of machine and how its BIOS reports to the kernel.

I have been totally unsuccessful in finding any answer to my question and I am afraid that I would not be any more successful when digging through the code. Can anybody answer this question authoritatively? Thanks in advance for your consideration!

Oct 27 '20 15:10 sm8ps

On Tue, Oct 27, 2020 at 08:59:02AM -0700, sm8ps wrote:

/var/log/mcelog contains the following. This happens on Ubuntu 16.04 reporting the mcelog version as (128+dfsg-1).

Hardware event. This is not a software error. MCE 0 CPU 1 BANK 8 TSC 235983e523450 MISC 2000000a6646 ADDR 93e6e4300 TIME 1603741601 Mon Oct 26 20:46:41 2020 MCG status: MCi status: Corrected error MCi_MISC register valid MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error Memory read ECC error Memory corrected error count (CORE_ERR_CNT): 1 Memory transaction Tracker ID (RTId): 46 Memory DIMM ID of error: 2 Memory channel ID of error: 2 Memory ECC syndrome: 2000 STATUS 8c0000400001009f MCGSTATUS 0 MCGCAP 1c09 APICID 20 SOCKETID 1 CPUID Vendor Intel Family 6 Model 44

I have been trying to switch the DIMM in memory bank 8 at CPU 1 as labeled on the motherboard. However, that particular kind of error has been reported again at the same location (CPU 1 BANK 8). Before wildly switching DIMMs around, I am hoping that somebody might be able to tell me what kind of numbering mcelog uses, starting from zero or from one (presumably the same for all kinds of objects).

From zero.

Actually CPUs could be offlined. In that case there would be holes.

For comparison, dmidecode uses labels like PROC {1,2} DIMM {1..9} which would make the numbering from one an obvious candidate. However, I have seen examples of mcelog counting the CPUs from zero. As for using both numberings, lshw lists cpu:{0,1} in slot: Proc {1,2} and memory:{0,1} with bank:{0..9} as physical id: {0..9} in PROC {1,2} DIMM {1..9}. Finally, it could even depend on the kind of machine and how its BIOS reports to the kernel.

mcelog uses the same numbering as Linux, which in term depends on the BIOS and the machine. Also these CPUs are of course cores and threads, while any labels on the motherboard would refer to sockets.

There is socket (and other) mapping to topology in /sys/devices/system/cpu/cpuX/topology/

However it depends on the motherboard and BIOS if that corresponds to the motherboard levels.

-Andi

Oct 28 '20 03:10 andikleen

Thanks so much for your answer @andikleen!

Thus the value CPU 1 in mcelog corresponds to the core listed as /sys/devices/system/cpu/cpu1/, right? Unfortunately the sub-directory topology/ contains only the files

core_id  core_siblings  core_siblings_list  physical_package_id  thread_siblings  thread_siblings_list

So I am still trying to find the right socket. The value in physical_package_id is always 1 in all the cpu#-directories corresponding to the values listed in core_siblings_list and is always 0 in all the other cpu#-directories. That seems to refer to the socket numbering, right?

All in all I would guess that it is socket nr. 1 out of (0,1) which seems to correspond to the value SOCKETID 1 in mcelog (which I had overlooked). The information for the motherboard mentions processor sockets 1 and 2 so I think it should be the second one. Does that sound right?

Next to identify the faulty DIMM module! Is the information Memory DIMM ID of error: 2 in mcelog the one I am looking for? I was at first convinced that it had to be nr. 8 as in (memory) BANK 8.

So in conclusion I should replace DIMM nr. 3 (out of 1..9) connected to socket 2 (of of 0..1), right?

Sorry for these very simple questions! This is the very first time that I have to dive into such matters and they have got me quite a bit confused. May others find the answers helpful, too!

Oct 28 '20 22:10 sm8ps