rasdaemon icon indicating copy to clipboard operation
rasdaemon copied to clipboard

rasdaemon does not log MCE

Open robinchrist opened this issue 1 year ago • 37 comments

Hi,

I'm using rasdaemon v0.6.8 (From Debian, https://packages.debian.org/de/bookworm/rasdaemon) on Kernel 5.15 (Proxmox 7.4, 5.15.102-1-pve) and ASRock X570D4U-2L2T + AMD Ryzen 5950X.

I do get some MCEs in the kernel log:

root@pve:~# dmesg | grep -i mce
[    0.644337] mce: [Hardware Error]: Machine check events logged
[    0.644338] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: dc2040000000011b
[    0.644342] mce: [Hardware Error]: TSC 0 ADDR a8eb3fc80 MISC d01202dd01000000 SYND 88e00040a800200 IPID 9600050f00 
[    0.644345] mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1682293811 SOCKET 0 APIC 0 microcode a201009
[    4.768515] MCE: In-kernel MCE decoding enabled.
[  310.396113] mce: [Hardware Error]: Machine check events logged
[  316.656894] mce: [Hardware Error]: Machine check events logged
[  627.947258] mce: [Hardware Error]: Machine check events logged
[  939.240972] mce: [Hardware Error]: Machine check events logged
[ 1250.534814] mce: [Hardware Error]: Machine check events logged
[ 1561.828702] mce: [Hardware Error]: Machine check events logged
[ 1873.122720] mce: [Hardware Error]: Machine check events logged

but ras-mc-ctl doesn't report anything:

root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

Everything seems to be running fine:

root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service 
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2023-04-24 01:50:17 CEST; 32min ago
    Process: 1013 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 1012 (rasdaemon)
      Tasks: 1 (limit: 154399)
     Memory: 15.3M
        CPU: 24ms
     CGroup: /system.slice/rasdaemon.service
             └─1012 /usr/sbin/rasdaemon -f -r

Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event mce:mce_record
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: ras:extlog_mem_event event enabled
Apr 24 01:50:17 pve rasdaemon[1012]: Enabled event ras:extlog_mem_event
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Listening to events for cpus 0 to 31
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mc_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording aer_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording extlog_event events
Apr 24 01:50:17 pve rasdaemon[1012]: rasdaemon: Recording mce_record events

root@pve:~# systemctl status ras
rasdaemon.service   ras-mc-ctl.service  
root@pve:~# systemctl status ras-mc-ctl.service 
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2023-04-24 01:50:17 CEST; 33min ago
    Process: 1011 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 1011 (code=exited, status=0/SUCCESS)
        CPU: 21ms

Apr 24 01:50:17 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
Apr 24 01:50:17 pve ras-mc-ctl[1011]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
Apr 24 01:50:17 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware

Any ideas on how this could be debugged?

robinchrist avatar Apr 24 '23 00:04 robinchrist

I am having a similar problem with similar hardware/software: rasdaemon v0.6.6 (From Debian, deb http://ftp.us.debian.org/debian bullseye) on Kernel 6.2 (Proxmox 7.4, 6.2.11-1-pve); ASRock X570D4U-2L2T; and AMD Ryzen 5950X.

root@pve:~# dmesg -T | grep -i mce
[Wed May 10 07:29:55 2023] MCE: In-kernel MCE decoding enabled.
[Wed May 10 08:06:13 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 08:37:21 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:13:40 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 09:44:48 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:21:07 2023] mce: [Hardware Error]: Machine check events logged
[Wed May 10 10:52:15 2023] mce: [Hardware Error]: Machine check events logged
root@pve:~# ras-mc-ctl --summary
No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.

No disk errors.

No MCE errors.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.
root@pve:~# systemctl status rasdaemon.service 
● rasdaemon.service - RAS daemon to log the RAS events
     Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-05-10 07:29:57 EDT; 3h 36min ago
    Process: 2587 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
   Main PID: 2582 (rasdaemon)
      Tasks: 1 (limit: 154393)
     Memory: 15.2M
        CPU: 33ms
     CGroup: /system.slice/rasdaemon.service
             └─2582 /usr/sbin/rasdaemon -f -r

May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:arm_event
May 10 07:29:57 pve rasdaemon[2582]: mce:mce_record event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event mce:mce_record
May 10 07:29:57 pve rasdaemon[2582]: ras:extlog_mem_event event enabled
May 10 07:29:57 pve rasdaemon[2582]: Enabled event ras:extlog_mem_event
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mc_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording aer_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording extlog_event events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording mce_record events
May 10 07:29:57 pve rasdaemon[2582]: rasdaemon: Recording arm_event events
root@pve:~# systemctl status ras-mc-ctl
● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware
     Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled)
     Active: active (exited) since Wed 2023-05-10 07:29:57 EDT; 3h 37min ago
    Process: 2574 ExecStart=/usr/sbin/ras-mc-ctl --register-labels (code=exited, status=0/SUCCESS)
   Main PID: 2574 (code=exited, status=0/SUCCESS)
        CPU: 20ms

May 10 07:29:57 pve systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware...
May 10 07:29:57 pve ras-mc-ctl[2574]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X570D4U-2L2T
May 10 07:29:57 pve systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.

I receive the errors only if I am running a VM (generally TrueNas Scale), and the errors occur as follows: first error occurs after 31 minutes, 19 seconds; then second error occurs after 36 minutes, 8 seconds; then third error occurs after 31 minutes, 19 seconds; then fourth error occurs after 36 minutes, 8 seconds; and so on until I shutdown the VM and/or the Host.

EDIT: @robinchrist, your problem appears to occur approximately every 5 minutes, 11 seconds.

PastramiKing avatar May 10 '23 15:05 PastramiKing

@PastramiKing do you have any memory OC running?

I did some experimental memory OC and the errors disappeared when I returned to stock, so I assume those were memory ECC errors.

robinchrist avatar May 11 '23 21:05 robinchrist

Same thing on Ubuntu server 22.04. Linux ecc 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux Gigabyte B550M DS3H + Ryzen 3 PRO 3200G + Samsung 3200MHz 16Gb ECC (downvolted and tightened timings to get ECC errors).

Syslog have ecc errors:

May 17 16:27:25 ecc kernel: [ 316.509297] mce: [Hardware Error]: Machine check events logged May 17 16:34:28 ecc kernel: [ 316.503731] mce: [Hardware Error]: Machine check events logged May 17 16:42:36 ecc kernel: [ 316.502289] mce: [Hardware Error]: Machine check events logged May 17 16:47:47 ecc kernel: [ 627.798510] mce: [Hardware Error]: Machine check events logged May 17 16:52:58 ecc kernel: [ 939.094438] mce: [Hardware Error]: Machine check events logged May 17 16:58:10 ecc kernel: [ 1250.390441] mce: [Hardware Error]: Machine check events logged May 17 17:03:21 ecc kernel: [ 1561.686645] mce: [Hardware Error]: Machine check events logged May 17 17:13:44 ecc kernel: [ 2184.278482] mce: [Hardware Error]: Machine check events logged

But rasdaemon (v0.6.7) says "Hey, all fine, no errors."

sudo ras-mc-ctl --errors No Memory errors.

No PCIe AER errors.

No Extlog errors.

No MCE errors.

sudo ras-mc-ctl --error-count Label CE UE DIMM_B2 0 0

Nuke79 avatar May 17 '23 17:05 Nuke79

I also get non recorded mce errors roughly every 5 minutes. running proxmox 7.4-15

~# systemctl status rasdaemon ● rasdaemon.service - RAS daemon to log the RAS events Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2023-07-09 14:55:22 CEST; 20h ago Main PID: 3150285 (rasdaemon) Tasks: 1 (limit: 38336) Memory: 592.0K CPU: 7ms CGroup: /system.slice/rasdaemon.service └─3150285 /usr/sbin/rasdaemon -f -r

Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Enabled event ras:extlog_mem_event Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: Enabled event mce:mce_record Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Listening to events for cpus 0 to 11 Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: ras:extlog_mem_event event enabled Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: Enabled event ras:extlog_mem_event Jul 09 14:55:22 pverdrmain rasdaemon[3150285]: rasdaemon: Recording mc_event events Jul 09 14:55:23 pverdrmain rasdaemon[3150285]: rasdaemon: Recording aer_event events Jul 09 14:55:23 pverdrmain rasdaemon[3150285]: rasdaemon: Recording extlog_event events Jul 09 14:55:24 pverdrmain rasdaemon[3150285]: rasdaemon: Recording mce_record events Jul 09 14:55:24 pverdrmain rasdaemon[3150285]: rasdaemon: Recording arm_event events

~# systemctl status ras-mc-ctl ● ras-mc-ctl.service - Initialize EDAC v3.0.0 Drivers For Machine Hardware Loaded: loaded (/lib/systemd/system/ras-mc-ctl.service; enabled; vendor preset: enabled) Active: active (exited) since Sun 2023-07-09 14:55:23 CEST; 20h ago Main PID: 3150354 (code=exited, status=0/SUCCESS) Tasks: 0 (limit: 38336) Memory: 0B CPU: 0 CGroup: /system.slice/ras-mc-ctl.service

Jul 09 14:55:23 pverdrmain systemd[1]: Starting Initialize EDAC v3.0.0 Drivers For Machine Hardware... Jul 09 14:55:23 pverdrmain ras-mc-ctl[3150354]: ras-mc-ctl: Error: No dimm labels for ASRockRack model X470D4U2-2T Jul 09 14:55:23 pverdrmain systemd[1]: Finished Initialize EDAC v3.0.0 Drivers For Machine Hardware.

[Mon Jul 10 11:19:13 2023] mce: [Hardware Error]: Machine check events logged [Mon Jul 10 11:24:24 2023] mce: [Hardware Error]: Machine check events logged [Mon Jul 10 11:29:36 2023] mce: [Hardware Error]: Machine check events logged [Mon Jul 10 11:34:47 2023] mce: [Hardware Error]: Machine check events logged [Mon Jul 10 11:39:58 2023] mce: [Hardware Error]: Machine check events logged [Mon Jul 10 11:45:09 2023] mce: [Hardware Error]: Machine check events logged

~# ras-mc-ctl --errors No Memory errors.

No PCIe AER errors.

No Extlog errors.

No devlink errors.

No disk errors.

No MCE errors.

Did anyone find out what is causing the errors and why rasdaemon 0.6.6 on debian is broken?

Is there a way to install rasdaemon 0.8 on debian 11? I tried but then I ran into package dependency hell and backed down.

githubDiversity avatar Jul 10 '23 09:07 githubDiversity

@robinchrist I also have this board with these errors.

voltagex avatar Aug 14 '23 14:08 voltagex

yes 5 minutes 11 seconds. I have the same on ASRock rack X470D4U2-2T with Ryzen 5 2600 using ECC memory.

I also tried this on debian 12 with the latest rasdaemon available there. Still the same just as @robinchrist.

@mchehab Is there are more recent version I could try? or perhaps enable some debugging options that could help out?

githubDiversity avatar Aug 14 '23 15:08 githubDiversity

I think I might be on to something regarding the 5 minutes (and in our case 11 seconds) interval. https://www.kernel.org/doc/Documentation/x86/x86_64/machinecheck --excerpt-- check_interval How often to poll for corrected machine check errors, in seconds (Note output is hexadecimal). Default 5 minutes. --end excerpt--

I could be totally wrong though, just drawing attention to it so that more knowledgeable people can decide if it is relevant or not.

githubDiversity avatar Aug 14 '23 16:08 githubDiversity

on debian 11, or rather proxmox 7.4 based on debian 11, the edac_mce_amd module is not loaded by default. and that module seems to be needed in order to have MCE errors decipherable when using AMD CPUs.

If have got that module loaded now but still rasdaemon is not recording MCE errors

githubDiversity avatar Aug 14 '23 17:08 githubDiversity

I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked.

Proxmox: latest, running kernel 6.2.16-8-pve.

Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors.

My errors are slightly more intermittent - but once the start they also have a 5minutes cadence:

[ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged

Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days.

DigiDr avatar Aug 21 '23 20:08 DigiDr

Can you elaborate on what you mean by unstable?I suggest running memtest86 for 24 hours and see if it repots errors.On 22 Aug 2023, at 06:05, DigiDr @.***> wrote: I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked. Proxmox: latest, running kernel 6.2.16-8-pve. Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors. My errors are slightly more intermittent - but once the start they also have a 5minutes cadence: [ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

voltagex avatar Aug 21 '23 22:08 voltagex

I do not believe checking memory is the next step.

I have a faulty memory module that I use to trigger memory related ECC messages. And on my setup they are reported in a separate catagory as MCE errors do.

I think this is CPU related or perhaps a systemic issue with asrock rack motherboards.

Anyway I tried getting through to AMD for technical support but that is rather difficult. Also the AMD community website is unable to help me out with my specific inquiry regarding this 5 minute 11 seconds MCE errors that are not providing any details.

I am not sure how to proceed now. Does anyone have an email of AMD tech support?

githubDiversity avatar Aug 22 '23 00:08 githubDiversity

Can you elaborate on what you mean by unstable?I suggest running memtest86 for 24 hours and see if it repots errors.On 22 Aug 2023, at 06:05, DigiDr @.> wrote: I wanted to report that i am also experiencing this constellation of symptoms: x570d4u-2l2t, ryzen 9 5900X and 128GB of ECC RAM. Nothing is overclocked. Proxmox: latest, running kernel 6.2.16-8-pve. Rasdaemon is installed (and patched against the sqlite bug) but does not report any errors. My errors are slightly more intermittent - but once the start they also have a 5minutes cadence: [ 316.956399] mce: [Hardware Error]: Machine check events logged [ 628.242421] mce: [Hardware Error]: Machine check events logged [ 939.532553] mce: [Hardware Error]: Machine check events logged [ 1250.822880] mce: [Hardware Error]: Machine check events logged [ 1562.113834] mce: [Hardware Error]: Machine check events logged [ 1873.404440] mce: [Hardware Error]: Machine check events logged [ 2184.694951] mce: [Hardware Error]: Machine check events logged [ 2495.986726] mce: [Hardware Error]: Machine check events logged [ 2807.277286] mce: [Hardware Error]: Machine check events logged [ 3118.567873] mce: [Hardware Error]: Machine check events logged [ 3429.858463] mce: [Hardware Error]: Machine check events logged [ 3741.149033] mce: [Hardware Error]: Machine check events logged [ 4052.439293] mce: [Hardware Error]: Machine check events logged [ 4363.729819] mce: [Hardware Error]: Machine check events logged [ 4675.020201] mce: [Hardware Error]: Machine check events logged [ 4986.310628] mce: [Hardware Error]: Machine check events logged [ 5297.601098] mce: [Hardware Error]: Machine check events logged Eventually the machine becomes unstable and needs to reboot, but usually this takes a few days. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.>

I don’t believe these are memory issues. My ECC ram would report these errors independently and the timing makes no sense. But we need rasdaemon to expose it in any case. As for unstable: VMs seem to hang after a few days, and this is the only link I can make to that behaviour - and the reason for investigating this.

DigiDr avatar Aug 22 '23 02:08 DigiDr

I do not believe checking memory is the next step.

I have a faulty memory module that I use to trigger memory related ECC messages. And on my setup they are reported in a separate catagory as MCE errors do.

I think this is CPU related or perhaps a systemic issue with asrock rack motherboards.

Anyway I tried getting through to AMD for technical support but that is rather difficult. Also the AMD community website is unable to help me out with my specific inquiry regarding this 5 minute 11 seconds MCE errors that are not providing any details.

I am not sure how to proceed now. Does anyone have an email of AMD tech support?

Given the cluster of reports with the same asrockrack boards, they should be the first line of inquiry. But we need rasdaemon to expose these errors properly.

DigiDr avatar Aug 22 '23 02:08 DigiDr

On debian 12 the latest version of rasdaemon is 0.6.8 but as reported here that is still buggy on debian with ryzen cpu and/or asrock rack motherboards.

So I am trying to get rasdaemon 0.8.0.x installed on debian 12 in an effort to shed some more light on these errors but I am not experienced enough to pull it off.

here I found the official source code for 0.8.0 http://www.infradead.org/~mchehab/rasdaemon/

But only compile and install instructions for fedora it seems. And that slams into walls on debian. The src.rpm, when converted to .deb using the alien package results in the following error when installing the rasdaemon.0.8.deb

apt install ./rasdaemon_0.8.0-2_amd64.deb Reading package lists... Done Building dependency tree... Done Reading state information... Done Note, selecting 'rasdaemon' instead of './rasdaemon_0.8.0-2_amd64.deb' The following packages were automatically installed and are no longer required: g++-10 libdbd-sqlite3-perl libdbi-perl libjim0.79 libopts25 libstdc++-10-dev libtiff5 libwebp6 pve-kernel-5.13 pve-kernel-5.13.19-6-pve pve-kernel-5.15.108-1-pve python3-distro-info telnet unattended-upgrades Use 'apt autoremove' to remove them. The following packages will be upgraded: rasdaemon 1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded. Need to get 0 B/403 kB of archives. After this operation, 41.0 kB of additional disk space will be used. Get:1 /root/rasdaemon/rasdaemon_0.8.0-2_amd64.deb rasdaemon amd64 0.8.0-2 [403 kB] Reading changelogs... Done (Reading database ... 99505 files and directories currently installed.) Preparing to unpack .../rasdaemon_0.8.0-2_amd64.deb ... Unpacking rasdaemon (0.8.0-2) over (0.6.8-1.1) ... Setting up rasdaemon (0.8.0-2) ... chown: invalid user: ‘mchehab:mchehab’ chown: invalid user: ‘mchehab:mchehab’ dpkg: error processing package rasdaemon (--configure): installed rasdaemon package post-installation script subprocess returned error exit status 1 Processing triggers for man-db (2.11.2-2) ... Errors were encountered while processing: rasdaemon N: Download is performed unsandboxed as root as file '/root/rasdaemon/rasdaemon_0.8.0-2_amd64.deb' couldn't be accessed by user '_apt'. - pkgAcquire::Run (13: Permission denied) E: Sub-process /usr/bin/dpkg returned an error code (1)

notice the user mchehab:mchehab not being found.

Looks like hard coded user names in source?

Anyway I am way out of my league here. Can anyone please point me into the right direction?

githubDiversity avatar Aug 25 '23 14:08 githubDiversity

or i could give nixos a try using a separate external usb as drive to install on. nixos seems to be able install any package at any version. Including 0.8

would that be worth the trouble?

githubDiversity avatar Aug 25 '23 15:08 githubDiversity

@githubDiversity I think i've managed it on debian (proxmox):

rm -r /var/lib/rasdaemon/ras-mc_event.db (we need the install to recreate this later with the right tables)

apt-get install make gcc autoconf automake libtool libevent-dev tar libsqlite3-dev libdbd-sqlite3-perl  libtraceevent-dev libtraceevent pkg-config


cd ~/
wget https://www.infradead.org/~mchehab/rasdaemon/rasdaemon-0.8.0.tar.bz2
tar -xvf rasdaemon-0.8.0.tar.bz2
cd rasdaemon-0.8.0

autoreconf -vfi
./configure  --enable-all --localstatedir=/var
make
make install

This left me on the latest version with all feature flags enabled. I'll let you know what i discover about these errors.

--localstatedir=/var forces it to use the default location for debian.

compile time options summary
============================

    Sqlite3             : yes
    AER                 : yes
    MCE                 : yes
    EXTLOG              : yes
    CPER non-standard   : yes
    ABRT report         : yes
    HISI Kunpeng errors : yes
    ARM events          : yes
    DEVLINK             : yes
    Disk I/O errors     : yes
    Memory Failure      : yes
    Memory CE PFA       : yes
    AMP RAS errors      : yes
    CPU fault isolation : yes

You can alternatively compile with some/other options from /configure --help if you don't want the whole lot enabled.

DigiDr avatar Aug 25 '23 17:08 DigiDr

Alas, this still hasn't revealed the source of the mce event

No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

DigiDr avatar Aug 25 '23 17:08 DigiDr

thank you @DigiDr for showing how to install from source on Debian.

Such a bummer that it lead no where.

I also have an Asrock rack X470D4U board with an AMD Ryzen 5 2600 Pro on it. On that setup I get no 5 min 11 seconds MCE errors. Once I am back from holiday I will swap the Ryzen 5 2600 with the Ryzen 5 2600 Pro on this X470D4U2-2T board.

If the errors then disappear then I think I have confirmed it is the CPU and not the board. Or might that not be the correct conclusion?

Anyway, here is a link to a thread with what I think is a related phenomenon. https://forum.level1techs.com/t/mce-corrected-errors/175366 It also mentions Red Hat telling customers with AMD CPUs to do additional steps like loading this edac_mce_amd module I mentioned a few posts earlier.

You might try loading the edac_mce_amd module and see if that enables rasdaemon 0.8 to make sense of the errors. I tried with rasdaemon 0.6.6 earlier but that did not change anything. I am not even sure if it is relevant this edac_mce_amd module

One final thing I will try is install windows server 2022 in the hopes that will uncover what is going on every 5 minutes and 11 seconds.

After that I am at my whits end and would hope @mchehab could pitch in.

githubDiversity avatar Aug 26 '23 02:08 githubDiversity

Dear all, I think the problem is going to be related to some settings in the BIOS/UEFI. AMD firmware has a lot of barely documented options related to MCE handling. For example, on some desktop AM4 boards, with some firmware versions, one needs to set Platform First Error Handling to disabled, otherwise ECC errors will not show up in the kernel logs. But beyond that there are a lot more options with no real documentation, like MCA error thresholding, and many more.

Therefore I think the solution to your problems could be to change one or more of the obscure RAS/MCA related firmware configuration options.

TiborGY avatar Aug 26 '23 16:08 TiborGY

otherwise ECC errors will not show up in the kernel logs. But beyond that there are a lot more options with no real documentation, like MCA error thresholding, and many more.

but if it shows up in the kernel logs (which it does, because otherwise we wouldn't know that rasdaemon doesn't report them), shouldn't rasdaemon be able to make sense of it?

Or can it be that the machine just reports "there was some error" to the OS but no additional information about what exactly etc?

Maybe some expert can jump in and help

robinchrist avatar Aug 26 '23 16:08 robinchrist

Or can it be that the machine just reports "there was some error" to the OS but no additional information about what exactly etc?

This is exactly what I am suspecting.

TiborGY avatar Aug 26 '23 16:08 TiborGY

I am fighting to get windows installed. damed that is hard on bare metal these days ;( anyway can you guys please check the voltage level of the onboard battery? you can find that in the overview page of the IPMI interface.

Mine is at 0.0V and I noticed also battery low erros. Not sure if it is related though

githubDiversity avatar Aug 31 '23 15:08 githubDiversity

ok I am getting closer (giving up on installing windows for the time being. it's too difficult grrr)

So it turns out that tracing was not enabled by default on debian 11 (proxmox 7.3) upgrading to debian 12 (proxmox 8.4) does not change anything.

cat /sys/kernel/debug/tracing/events/mce/mce_record/enable 0 should be 1

after I enabled it I get this in the trace ` cat /sys/kernel/debug/tracing/trace

tracer: nop

entries-in-buffer/entries-written: 4/4 #P:12

_-----=> irqs-off/BH-disabled

/ _----=> need-resched

| / _---=> hardirq/softirq

|| / _--=> preempt-depth

||| / _-=> migrate-disable

|||| / delay

TASK-PID CPU# ||||| TIMESTAMP FUNCTION

| | | ||||| | |

 kworker/0:2-89      [000] .....  6853.046146: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507232, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7164.338881: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507544, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7475.635616: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693507855, SOCKET: 0, APIC: 0
 kworker/0:2-89      [000] .....  7786.924353: mce_record: CPU: 0, MCGc/s: 117/0, MC15: dc2040000000011b, IPID: 0000009600050f00, ADDR/MISC/SYND: 00000003c0b42f00/d01b0fff01000000/000002630a400a02, RIP: 00:<0000000000000000>, TSC: 0, PROCESSOR: 2:800f82, TIME: 1693508166, SOCKET: 0, APIC: 0

systemctl status rasdaemon ● rasdaemon.service - RAS daemon to log the RAS events Loaded: loaded (/lib/systemd/system/rasdaemon.service; enabled; preset: enabled) Active: active (running) since Thu 2023-08-31 20:48:20 CEST; 8min ago Main PID: 228553 (rasdaemon) Tasks: 26 (limit: 38328) Memory: 6.4M CPU: 144ms CGroup: /system.slice/rasdaemon.service └─228553 /usr/sbin/rasdaemon -f -r

Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording extlog_event events Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording extlog_event events Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: read Aug 31 20:48:20 rasdaemon[228553]: rasdaemon: Recording mce_record events Aug 31 20:50:55 rasdaemon[228553]: rasdaemon: mce_record store: 0x7fe03c022c88`

But still rasdaemon is not recording those errors, let alone decoding them into human readable format.

Is not this a serious issue that if tracing is not enabled by default then a lot of people might feel covered by rasdaemon while they are not? I wonder if the proxmox team is aware of this. or perhaps this is something that better configured if one uses the enterprise subscription

githubDiversity avatar Aug 31 '23 19:08 githubDiversity

thanks to @DigiDr 's compile instructions I am now running rasdaemon 0.8

tracing is enabled en the trace is being populated.

But still rasdaemon is not playing ball /usr/local/sbin/ras-mc-ctl --errors No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

I think now is the time to escalate to @mchehab. rasdaemon seems seriously broken for several versions already.

githubDiversity avatar Aug 31 '23 19:08 githubDiversity

rasdaemon silently crashes with a segmentation fault after a few hours running in the foreground.

` rasdaemon -r -f rasdaemon: Improper PAGE_CE_ACTION, set to default soft rasdaemon: Page offline choice on Corrected Errors is soft rasdaemon: Improper PAGE_CE_THRESHOLD, set to default 50. rasdaemon: Improper PAGE_CE_REFRESH_CYCLE, set to default 24h. rasdaemon: Threshold of memory Corrected Errors is 50 / 24h rasdaemon: ras:mc_event event enabled rasdaemon: Enabled event ras:mc_event rasdaemon: ras:aer_event event enabled rasdaemon: Enabled event ras:aer_event rasdaemon: ras:non_standard_event event enabled rasdaemon: Enabled event ras:non_standard_event rasdaemon: ras:arm_event event enabled rasdaemon: Enabled event ras:arm_event rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu0/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu12/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu13/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu14/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu15/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu16/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu17/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu18/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu19/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu20/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu21/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu22/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu23/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu24/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu25/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu26/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu27/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu28/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu29/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu30/online failed rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu31/online failed rasdaemon: Cpu fault isolation is disabled rasdaemon: mce:mce_record event enabled rasdaemon: Enabled event mce:mce_record rasdaemon: ras:extlog_mem_event event enabled rasdaemon: Enabled event ras:extlog_mem_event rasdaemon: net:net_dev_xmit_timeout event enabled rasdaemon: Enabled event net:net_dev_xmit_timeout rasdaemon: devlink:devlink_health_report event enabled rasdaemon: Enabled event devlink:devlink_health_report rasdaemon: block:block_rq_error event enabled rasdaemon: Enabled event block:block_rq_error rasdaemon: ras:memory_failure_event event enabled rasdaemon: Enabled event ras:memory_failure_event rasdaemon: Listening to events for cpus 0 to 31 Calling ras_mc_event_opendb() rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording mce_record events rasdaemon: Recording non_standard_event events rasdaemon: Recording arm_event events rasdaemon: Recording devlink_event events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: Error on CPU 12 rasdaemon: Error on CPU 13 rasdaemon: Error on CPU 14 rasdaemon: Error on CPU 15 rasdaemon: Error on CPU 16 rasdaemon: Error on CPU 17 rasdaemon: Error on CPU 18 rasdaemon: Error on CPU 19 rasdaemon: Error on CPU 20 rasdaemon: Error on CPU 21 rasdaemon: Error on CPU 22 rasdaemon: Error on CPU 23 rasdaemon: Error on CPU 24 rasdaemon: Error on CPU 25 rasdaemon: Error on CPU 26 rasdaemon: Error on CPU 27 rasdaemon: Error on CPU 28 rasdaemon: Error on CPU 29 rasdaemon: Error on CPU 30 rasdaemon: Error on CPU 31 rasdaemon: Old kernel detected. Stop listening and fall back to pthread way. Calling ras_mc_event_closedb() rasdaemon: Listening to events on cpu 0 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 1 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 2 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 3 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 4 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 5 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 6 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 7 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 8 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 9 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 10 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 11 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 12 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 14 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 13 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 15 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 16 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 17 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 18 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 19 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 20 rasdaemon: Listening to events on cpu 23 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 21 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 22 Calling ras_mc_event_opendb() Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 24 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 25 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 26 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 28 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 27 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 29 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 30 Calling ras_mc_event_opendb() rasdaemon: Listening to events on cpu 31 Calling ras_mc_event_opendb() rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording aer_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording mce_record events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording arm_event events rasdaemon: Recording mce_record events rasdaemon: rasdaemon: Recording mce_record events Recording mc_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording mce_record events rasdaemon: Recording devlink_event events rasdaemon: Recording mce_record events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording arm_event events rasdaemon: Recording mce_record events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording mce_record events rasdaemon: Recording non_standard_event events rasdaemon: Recording disk_errors events rasdaemon: Recording mce_record events rasdaemon: Recording arm_event events rasdaemon: Recording devlink_event events rasdaemon: Recording arm_event events rasdaemon: Recording arm_event events rasdaemon: rasdaemon: Recording arm_event events rasdaemon: Recording non_standard_event events rasdaemon: rasdaemon: Recording arm_event events rasdaemon: Recording arm_event events rasdaemon: rasdaemon: Recording arm_event events Recording arm_event events Recording non_standard_event events Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording arm_event events rasdaemon: Recording arm_event events rasdaemon: Recording devlink_event events rasdaemon: Recording memory_failure_event events rasdaemon: Recording arm_event events rasdaemon: Recording devlink_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording devlink_event events rasdaemon: Recording devlink_event events rasdaemon: Recording devlink_event events rasdaemon: Recording devlink_event events rasdaemon: rasdaemon: rasdaemon: Recording disk_errors events Recording devlink_event events Recording arm_event events rasdaemon: Recording devlink_event events rasdaemon: Recording devlink_event events rasdaemon: Recording devlink_event events rasdaemon: Recording arm_event events rasdaemon: Recording arm_event events rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording extlog_event events rasdaemon: Recording devlink_event events rasdaemon: Recording mc_event events rasdaemon: rasdaemon: Recording disk_errors events Recording memory_failure_event events rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording aer_event events rasdaemon: Recording memory_failure_event events rasdaemon: Recording devlink_event events rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording devlink_event events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording devlink_event events rasdaemon: Recording memory_failure_event events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: Recording mc_event events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: Recording memory_failure_event events rasdaemon: Recording aer_event events rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording memory_failure_event events rasdaemon: read rasdaemon: Recording mc_event events Calling ras_mc_event_closedb() rasdaemon: Recording aer_event events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording extlog_event events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording memory_failure_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording extlog_event events rasdaemon: Recording extlog_event events rasdaemon: Recording mce_record events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording mc_event events rasdaemon: Recording mce_record events rasdaemon: Recording mc_event events rasdaemon: Recording aer_event events rasdaemon: Recording aer_event events rasdaemon: Recording mc_event events rasdaemon: Recording extlog_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording aer_event events rasdaemon: rasdaemon: Recording mc_event events Recording extlog_event events rasdaemon: Recording aer_event events rasdaemon: rasdaemon: Recording extlog_event events rasdaemon: Recording mce_record events Recording arm_event events rasdaemon: Recording extlog_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording devlink_event events rasdaemon: Recording mce_record events rasdaemon: Recording arm_event events rasdaemon: Recording disk_errors events rasdaemon: Recording non_standard_event events rasdaemon: rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording aer_event events rasdaemon: Recording mce_record events rasdaemon: Recording extlog_event events Recording mce_record events rasdaemon: Recording devlink_event events rasdaemon: Recording arm_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording disk_errors events rasdaemon: rasdaemon: Recording aer_event events Recording devlink_event events rasdaemon: Recording memory_failure_event events rasdaemon: read rasdaemon: rasdaemon: rasdaemon: Recording disk_errors events Recording arm_event events rasdaemon: Recording extlog_event events Recording mce_record events rasdaemon: Recording arm_event events Calling ras_mc_event_closedb() rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording devlink_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording devlink_event events rasdaemon: Recording mce_record events rasdaemon: Recording disk_errors events rasdaemon: Recording arm_event events rasdaemon: Recording non_standard_event events rasdaemon: Recording memory_failure_event events rasdaemon: Recording devlink_event events rasdaemon: Recording arm_event events rasdaemon: Recording disk_errors events rasdaemon: Recording devlink_event events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording disk_errors events rasdaemon: Recording disk_errors events rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: Recording memory_failure_event events rasdaemon: read Calling ras_mc_event_closedb() rasdaemon: mce_record store: 0x7f3794022ca8

Segmentation fault

`

but the rasdaemon sqlite.db remains unchanged /var/lib/rasdaemon# ls -l total 5 -rw-r--r-- 1 root root 40960 Aug 31 21:37 ras-mc_event.db

/usr/local/sbin/ras-mc-ctl --status ras-mc-ctl: drivers are loaded.

/usr/local/sbin/ras-mc-ctl --errors No Memory errors.

No PCIe AER errors.

No ARM processor errors.

No Extlog errors.

No devlink errors.

No disk errors.

No Memory failure errors.

No MCE errors.

From the output of running rasdaemon in the foreground I've noticed rasdaemon tried listening, opening files for, cpu's that do not exist.

I have the Ryzen 5 2600 with 6 core / 12 threads. Not sure why it tried to do things with cpus up to 31.

` ls /sys/devices/system/cpu/ cpu0 cpu10 cpu2 cpu4 cpu6 cpu8 cpufreq hotplug kernel_max modalias online power smt vulnerabilities cpu1 cpu11 cpu3 cpu5 cpu7 cpu9 cpuidle isolated microcode offline possible present uevent

`

githubDiversity avatar Sep 01 '23 02:09 githubDiversity

I no longer think this is the best place to discuss our issues as rasdaemon, although not working as expected, is no longer related.

How I came to that conclusing below.

If one wants to follow my progress as I pint down the exact cause please DM me and I will create a thread on the asrock rack support forums.


tried running fedora 38 kde plasma from a live usb for a while

installed rasdaemon 0.8.1

same 5 minutes 11 seconds thing but still rasdaemon seems broken. nothing recorded even after I enabled tracing. which I think should be check if it is enabled by rasdaemon on startup.

Same tracing info as on debian 12.

I replaced the CPU with a ryzen 5 2600 pro but still the same 5 min 11 seconds mce errors. I did manage to destroy my ability to remote view via IPMI though. yeee no good deed goes unpunished ;(

So now I will start removing PCI connected devices and see if that changes things but the progress I will not share here as not related to rasdaemon.

githubDiversity avatar Sep 01 '23 07:09 githubDiversity

I am not well versed in this site. Can I make/receive Private Messages here?

Anyway I have gone and opened a support ticket at asrockrack. Since these are server products the asrock forum is not the place to ask questions

githubDiversity avatar Sep 01 '23 11:09 githubDiversity

As per usual Asrock Rack tech support is interested to help out and we are now in the process of digging down.

But I really think that as soon as linux systems uses rasdaemon then it should work out of the box or at least tell the admin why it will probably not work. @mchehab I really think we need your involvement right now. this is no longer just an isolated case.

githubDiversity avatar Sep 05 '23 09:09 githubDiversity

That's interesting - what response did you get from tech support?On 5 Sep 2023, at 19:13, githubDiversity @.***> wrote: As per usual Asrock Rack tech support is interested to help out and we are now in the process of digging down. But I really think that as soon as linux systems uses rasdaemon then it should work out of the box or at least tell the admin why it will probably not work. @mchehab I really think we need your involvement right now. this is no longer just an isolated case.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

voltagex avatar Sep 05 '23 12:09 voltagex

like i stated earlier I do not believe this is the correct place to discuss mobo related issues.

All I hope for is that the issue regarding rasdaemon seemingly being broken for a long time already gets the needed attention it needs as many people run it.

githubDiversity avatar Sep 05 '23 12:09 githubDiversity