rasdaemon icon indicating copy to clipboard operation
rasdaemon copied to clipboard

rasdaemon: ipmitool SEL logging of AER CEs on OpenBMC platforms

Open krishnanadh opened this issue 2 years ago • 8 comments

Log to OpenBMC SEL logs, all the AER correctable errors that are handled by the kernel. The IPMI command record fields used are defined in the IPMI specification v2.0. Non-timestamped record type is used here given that the BMC will attach one at the time of logging.

Test Plan: Tested on OpenBMC platforms using EINJ.

Signed-off-by: Krishna Dhulipala [email protected] Reviewed-by: Ril Van Riel [email protected]

krishnanadh avatar May 03 '23 02:05 krishnanadh

@mchehab, Can you please help review the PR?

krishnanadh avatar May 08 '23 18:05 krishnanadh

Just a question: How do we identify if a platform is an OpenBMC one where ipmitool should/could be used to report the AER information ?

RocheWilliam avatar Jul 13 '23 11:07 RocheWilliam

There's no straightforward way to identify what BMC type is running underneath. Therefore added the configure option.

krishnanadh avatar Jul 13 '23 17:07 krishnanadh

If we can't automatically identify what machines should use the ipmitool calls implemented here, shouldn't we have a runtime option to enable or disable this ipmitool use in addition to the configure option (generating a different binary) ? So that an identical binary can be used on OpenBMC platforms and non OpenBMC ?

RocheWilliam avatar Jul 14 '23 06:07 RocheWilliam

Looked into adding a runtime option through a command line argument. The ipmitool call is running in the even handler and since we are using libtraceevent to register the ras_aser_event_handler, passing down a runtime argument will not be as clean. Any suggestions on how to go about it?

krishnanadh avatar Jul 17 '23 20:07 krishnanadh

I would suggest to enrich the handler_ras_events() function passing more arguments information than just the "record_events" so that we could have an initialize function like ras_aer_handler_init(int force_ipmitool) that would update a static flag in the ras-aer-handler.c for example. (I also had to create a mechanism for some Ampere machines to use ipmitool in case of AER errors -- you can give a look at this code which is also the pull request #56 so that a single arm binary can be used on these platforms or any other arm platforms without modification)

RocheWilliam avatar Jul 18 '23 12:07 RocheWilliam

Thank you for the suggestions. Added a runtime argument -i and patched this PR.

krishnanadh avatar Jul 19 '23 04:07 krishnanadh

@mchehab, Can you please help review the PR?

Sorry, I've been busy those days. There are some issues on this PR:

  • The changes are not following the Linux Kernel coding style. We use it within rasdaemon. In particular, indents are done with tabs, and they mean 8 spaces;
  • Patches need to have your certificate of origin (Signed-off-by): See https://wiki.linuxfoundation.org/dco;
  • There are 3 patches with the same name/description but different contents. You should likely merge them, together with a typo fix;
  • with regards to this commit: d7d4c57efbd2 ("Only log correctable errors for unified SEL") why are you dropping uncorrectable logs? Shouldn't it be an option to enable/disable it at runtime? With such patch, rasdaemon generates a build warning:
    unified-sel.c:33:20: warning: ‘uncor_error_ids’ defined but not used [-Wunused-variable]
     33 | static const char *uncor_error_ids[32] = {
        |                    ^~~~~~~~~~~~~~~
    

mchehab avatar Oct 23 '23 08:10 mchehab