edac-utils icon indicating copy to clipboard operation
edac-utils copied to clipboard

on two servers errors on different banks but with same "CPU_SrcID#0_Ha#0_Chan#2_DIMM#0"

Open f1-outsourcing opened this issue 4 years ago • 0 comments

I have problems identifying what memory module is generating the errors. Afaik this is the best indicator "EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5:" I do not know why different banks generate the same "CPU_SrcID#0_Ha#0_Chan#2_DIMM#0" statement.

Currently the memory is allocated, so I cannot test it on a live system. If I would like to test this memory area after a reboot with memtester eg. would this be the correct way to do it?

memtester -p 0x00800000000 1G 1

server 1

[Mon Jan 27 19:27:15 2020] mce: [Hardware Error]: Machine check events logged
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010092
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: TSC 0
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: ADDR 849923380
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: MISC 42525286
[Mon Jan 27 19:27:15 2020] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1580149818 SOCKET 0 APIC 0
[Mon Jan 27 19:27:15 2020] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x849923 offset:0x380 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:0)

server 2


[Sun Jan 19 11:31:32 2020] mce: [Hardware Error]: Machine check events logged
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c000050000800c2
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: TSC 0
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: ADDR 6989c7000
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: MISC 90000000000208c
[Sun Jan 19 11:31:32 2020] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1579430445 SOCKET 0 APIC 0
[Sun Jan 19 11:31:32 2020] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 page:0x6989c7 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:4 rank:255)

f1-outsourcing avatar Jan 28 '20 21:01 f1-outsourcing