ipmi_exporter icon indicating copy to clipboard operation
ipmi_exporter copied to clipboard

feat: add ipmi_sel_events_time

Open HappyFX opened this issue 1 year ago • 3 comments

Add type in ipmi_sel_events_count_by_state - so it can be distinguish between sel events

ipmi_sel_events_count_by_state{type="Power Supply",state="Critical"} 18

Add new metric about more detailed resent time occurrence event in log, so it can be stored in prometheus even after sel log was cleared. It's not 1-1 sel log event copy, so it won't be with big cardinality. 18 real event are presented as 4 metrics with resent time occurrence

ipmi_sel_events_time{event="Power Supply Failure detected",name="Power Supply 2 Status",state="Critical",type="Power Supply"} 1.731064451e+09
ipmi_sel_events_time{event="Power Supply Failure detected ; Fan Fault",name="Power Supply 2 Status",state="Critical",type="Power Supply"} 1.731064449e+09
ipmi_sel_events_time{event="Power Supply input lost (AC/DC)",name="Power Supply 2 Status",state="Critical",type="Power Supply"} 1.727789819e+09
ipmi_sel_events_time{event="Redundancy Lost",name="System Board PS Redundancy",state="Critical",type="Power Supply"} 1.731064452e+09

Original sel-log for example above:

4   | Sep-10-2024 | 16:14:38 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
5   | Sep-10-2024 | 16:14:38 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
7   | Sep-10-2024 | 16:14:42 | System Board PS Redundancy       | Power Supply                | Critical | Redundancy Lost
8   | Oct-01-2024 | 09:26:41 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply input lost (AC/DC)
9   | Oct-01-2024 | 09:26:50 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
10  | Oct-01-2024 | 09:26:50 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
14  | Oct-01-2024 | 11:51:45 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
15  | Oct-01-2024 | 11:51:47 | System Board PS Redundancy       | Power Supply                | Critical | Redundancy Lost
16  | Oct-01-2024 | 11:51:47 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
18  | Oct-01-2024 | 13:36:59 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply input lost (AC/DC)
21  | Oct-01-2024 | 13:37:58 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
23  | Oct-01-2024 | 13:37:59 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
26  | Oct-07-2024 | 19:24:08 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
28  | Oct-07-2024 | 19:24:11 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
29  | Oct-07-2024 | 19:24:12 | System Board PS Redundancy       | Power Supply                | Critical | Redundancy Lost
31  | Nov-08-2024 | 11:14:09 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected ; Fan Fault
32  | Nov-08-2024 | 11:14:11 | Power Supply 2 Status            | Power Supply                | Critical | Power Supply Failure detected
33  | Nov-08-2024 | 11:14:12 | System Board PS Redundancy       | Power Supply                | Critical | Redundancy Lost

HappyFX avatar Dec 06 '24 19:12 HappyFX

Hello, can you check the PR? @bitfehler @SuperQ

HappyFX avatar Dec 10 '24 09:12 HappyFX

@RichiH maybe you can check this PR?

HappyFX avatar Dec 19 '24 11:12 HappyFX

Now, I'll be honest: I am not sure what it is you are ultimately trying to achieve, but you should really consider ingesting your SEL into a logging solution of your choice, because it looks like you are trying (again) to use Prometheus as a log monitoring platform, which it isn't.

Yes, cardinality is somewhat reduced from your original proposal, but that doesn't make this right. I also don't understand why the custom event mechanism, which is extremely flexible, does not work for you?

In general, I don't think the SEL metrics should be expanded any further, because metrics are not about logs. I would at most be willing to make two concessions: I am ok with including the type in the metrics (though we may have to rename the metric then), and if you can explain to me why the custom events stuff doesn't work for you (other than "I want all my logs in Prometheus"), we can maybe adapt it to suit your needs, but this would have to be marginal stuff.

bitfehler avatar Feb 03 '25 11:02 bitfehler