node-exporter-textfile-collector-scripts
node-exporter-textfile-collector-scripts copied to clipboard
storcli.py shows battery_backup_healthy when it needs attention
I have some megaraid controllers which are returning the following:
megaraid_healthy 0 <== there's a problem
megaraid_failed 0
megaraid_degraded 0
megaraid_battery_backup_healthy 1
This is odd: the controller says it needs attention, but it's not obvious why.
On closer inspection: storcli.py returns battery_backup_healthy 1
if the BBU state is 0 or 32. I'm getting 32, and the battery is also "Degraded":
# /opt/MegaRAID/storcli/storcli64 /cALL show all J | less
...
"Status" : {
==> "Controller Status" : "Needs Attention",
"Memory Correctable Errors" : 0,
"Memory Uncorrectable Errors" : 0,
"ECC Bucket Count" : 0,
"Any Offline VD Cache Preserved" : "No",
==> "BBU Status" : 32,
"PD Firmware Download in progress" : "No",
"Support PD Firmware Download" : "No",
"Lock Key Assigned" : "No",
"Failed to get lock key on bootup" : "No",
"Lock key has not been backed up" : "No",
"Bios was not detected during boot" : "No",
"Controller must be rebooted to complete security operation" : "No",
"A rollback operation is in progress" : "No",
"At least one PFK exists in NVRAM" : "No",
"SSC Policy is WB" : "No",
"Controller has booted into safe mode" : "No",
"Controller shutdown required" : "No"
},
...
"BBU_Info" : [
{
"Model" : "iBBU",
==> "State" : "Dgd (Needs Attention)",
"RetentionTime" : "48 hours +",
"Temp" : "29C",
"Mode" : "-",
"MfgDate" : "2014/02/10",
"Next Learn" : "2019/06/27 01:33:42"
}
]
My best guess is that the controller "Needs Attention" because of the battery status, but I can't find documentation for what status=32 means. Can you point to some info which says that 32 is healthy?
For comparison, here's what MegaCLI says on the same controller:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4014 mV
Current: 0 mA
Temperature: 29 C
Battery State: Degraded(Need Attention)
A manual learn is required.
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : No
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 75 %
Charger System State: 49169
Charger System Ctrl: 0
Charging current: 512 mA
Absolute state of charge: 77 %
Max Error: 9 %
Exit Code: 0x00
Perhaps 32 means "manual learn is required"? But in that case, I'd say it's not "healthy", in the sense that some attention is required.
On another controller, which is healthy, the BBU state is 0. This one has CacheVault_Info rather than BBU_Info:
"Cachevault_Info" : [
{
"Model" : "CVPM02",
"State" : "Optimal",
"Temp" : "30C",
"Mode" : "-",
"MfgDate" : "2014/05/30"
}
]
(Aside 1: storcli.py provides a metric megaraid_cv_temperature
for the temperature from Cachevault_Info, but not the temperature from BBU_Info)
On a different controller, which doesn't have a BBU at all, I get megaraid_battery_backup_healthy 0
. In other words: it's flagging as a battery "bad" even though the controller is healthy and there's no action required. The JSON contains:
"BBU Status" : "NA",
(Aside 2: I would be inclined in this state to drop the megaraid_battery_backup_healthy metric entirely. Otherwise we get a false alarm about a bad battery, especially since there's no other metric saying whether the BBU is present or not. On the other hand, I can suppress this alarm if megaraid_healthy is 1, which is is)
So in summary:
- Can anyone confirm what BBU status 32 means?
- Is it correct for storcli.py to report the battery as "healthy" in this condition, even though the overall controller health is "needs attention"?
- Should we return BBU_Info temperature as a different metric, e.g.
megaraid_bbu_temperature
? - Should we suppress the megaraid_battery_backup_healthy metric if the BBU is not present (status="NA")? Or have a different metric for BBU present/absent?
Added megaraid_bbu_temperature to PR #20
Can anyone confirm what BBU status 32 means?
The BBU status indicates the type of BBU. A value of 0 is CacheVault (i.e. supercap), and a value of 32 is legacy battery-type BBU. IIRC this value is actually a bitmask.
IIRC this value is actually a bitmask.
That's definitely true; for example I can see cards 1024 or 2048 for the status. The ones with 1024 show "Battery Replacement required : Yes" in MegaCLI.
A value of 0 is CacheVault (i.e. supercap), and a value of 32 is legacy battery-type BBU
That's interesting. However, I have access to a mix of servers built over the years, and I can give you a counter-example to that theory:
"Status" : {
"Controller Status" : "Optimal",
"Memory Correctable Errors" : 0,
"Memory Uncorrectable Errors" : 0,
"ECC Bucket Count" : 0,
"Any Offline VD Cache Preserved" : "No",
"BBU Status" : 0,
"PD Firmware Download in progress" : "No",
"Support PD Firmware Download" : "No",
"Lock Key Assigned" : "No",
"Failed to get lock key on bootup" : "No",
"Lock key has not been backed up" : "No",
"Bios was not detected during boot" : "No",
"Controller must be rebooted to complete security operation" : "No",
"A rollback operation is in progress" : "No",
"At least one PFK exists in NVRAM" : "No",
"SSC Policy is WB" : "No",
"Controller has booted into safe mode" : "No",
"Controller shutdown required" : "No"
},
...
"BBU_Info" : [
{
"Model" : "iBBU",
"State" : "Optimal",
"RetentionTime" : "48 hours +",
"Temp" : "22C",
"Mode" : "-",
"MfgDate" : "2014/03/04",
"Next Learn" : "2018/01/26 16:09:54"
}
]
Note that this one returns BBU status 0, but it has BBU_Info
rather than Cachevault_Info
. MegaCLI says:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 3999 mV
Current: 0 mA
Temperature: 22 C
Battery State: Optimal
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : No
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 67 %
Charger System State: 49169
Charger System Ctrl: 0
Charging current: 512 mA
Absolute state of charge: 67 %
Max Error: 6 %
Exit Code: 0x00
Here is another counter-example: the status is 2048 but this is a legacy iBBU (returns BBU_Info
not Cachevault_Info
in storcli J). NOTE this server has two adapters, 0 and 1, and it's adapter 1 that gives the 2048 status in storcli. MegaCLI output:
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4045 mV
Current: 0 mA
Temperature: 35 C
Battery State: Optimal
BBU Firmware Status:
<< snip >>
BBU status for Adapter: 1
BatteryType: iBBU
Battery State: Unknown
Exit Code: 0x00
EDIT: it looks like 2048 = failed communication (to adapter 1's BBU)
I can't find a counter-example in the opposite direction. All the cards which have Cachevault_Info are currently returning BBU Status 0, so none are degraded. That's happy from an operations point of view, but sad from the point of view of understanding the flags. Here's an example:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: CVPM02
Voltage: 9474 mV
Current: 0 mA
Temperature: 31 C
Battery State: Optimal
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
BBU GasGauge Status: 0x64e2
Pack energy : 226 J
Capacitance : 100
Remaining reserve space : 0
Exit Code: 0x00
Looking again at the older cards which have BBU_Info instead of CachevaultInfo, I found one which returns BBU Status 40 (i.e. it has two bits set). MegaCLI says:
# /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4008 mV
Current: 0 mA
Temperature: 30 C
Battery State: Degraded(Need Attention)
A manual learn is required.
BBU Firmware Status:
Charging Status : Charging
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : No
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 75 %
Charger System State: 49169
Charger System Ctrl: 0
Charging current: 512 mA
Absolute state of charge: 63 %
Max Error: 15 %
Exit Code: 0x00
Compared to the original "degraded" one I posted at the top of this thread, this has "Charging Status : Charging" instead of "Charging Status : None". So it's possible that 32 = Learn cycle requested and 8 = Charging, but that's still just a guess.
I also found one with 1056 (= 1024+32)
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 4008 mV
Current: 0 mA
Temperature: 34 C
Battery State: Degraded(Need Attention)
A manual learn is required.
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : Yes
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : Yes
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 100 %
Charger System State: 49168
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 37 %
Max Error: 11 %
Exit Code: 0x00
Thinks: Battery replacement required = 1024, learn cycle requested = 32
It looks like I need to tabulate this all properly :-(
You're probably right. I remember trying to make sense of the bbu status value with @mulbc last year, and due to our limited sample size, the best guess we could come up with was that 0 = CacheVault and 32 = legacy BBU. It's quite likely however that simply due to the age of the BBUs we had available, they were in need of re-learning.
LSI / Avago / Broadcom are notoriously tight-lipped about all of this stuff. I would have dearly liked documentation on how to read CacheCade hit ratios (e.g. via some ioctl). Without the relevant documentation, the best that can be done is enabling CacheCade stats and periodically retrieving the stats as a text file with storcli (whose format is also not exactly obvious).
Here's the collected table, showing the BBU Status from storcli, and all the MegaCLI flags which were different on at least one controller.
Num | BBU Status | BatteryType | Charging Status | Learn Cycle Requested | Battery Replacement required | Battery State | Fully Charged |
---|---|---|---|---|---|---|---|
1 | 0 | iBBU08 | None | No | No | Optimal | - |
2 | 1024 | iBBU | None | No | Yes | Failed | Yes |
3 | 1056 | iBBU | None | Yes | Yes | Degraded(Need Attention) | Yes |
4 | 1024 | iBBU | None | No | Yes | Failed | Yes |
5 | 0 | iBBU | None | No | No | Optimal | Yes |
6 | 2048 | iBBU (X) | - | - | - | Unknown | - |
7 | 40 | iBBU | Charging | Yes | No | Degraded(Need Attention) | No |
8 | 32 | iBBU | None | Yes | No | Degraded(Need Attention) | No |
9 | 32 | iBBU | None | Yes | No | Degraded(Need Attention) | No |
10 | 0 | iBBU | None | No | No | Optimal | No |
11 | 0 | CVPM02 | None | No | No | Optimal | - |
12 | 0 | CVPM02 | None | No | No | Optimal | - |
13 | 0 | CVPM02 | None | No | No | Optimal | - |
14 | NA | - | - | - | - | - | - |
15 | NA | - | - | - | - | - | - |
16 | NA | - | - | - | - | - | - |
17 | 0 | CVPM02 | None | No | No | Optimal | - |
18 | 0 | CVPM02 | None | No | No | Optimal | - |
19 | 0 | CVPM02 | None | No | No | Optimal | - |
20 | 0 | CVPM02 | None | No | No | Optimal | - |
21 | 0 | CVPM02 | None | No | No | Optimal | - |
22 | 32 | iBBU | None | Yes | No | Degraded(Need Attention) | No |
23 | 32 | iBBU | None | Yes | No | Degraded(Need Attention) | Yes |
24 | NA | - | - | - | - | - | - |
25 | 0 | iBBU08 | None | No | No | Optimal | - |
26 | 0 | CVPM02 | None | No | No | Optimal | - |
27 | 0 | CVPM03 | None | No | No | Optimal | - |
28 | 0 | CVPM03 | None | No | No | Optimal | - |
(X)
BBU status for Adapter: 1
BatteryType: iBBU
Battery State: Unknown
From this I think I can say with a good degree of confidence:
- 8 = Charging
- 32 = Learn Cycle requested
- 1024 = Battery Replacement requested
- 2048 = Unknown / communication with BBU failed
And that no bit in the status word gives the difference between BBU and Cachevault, nor whether it's fully charged or not.
In the examples above, all Healthy states have status 0, and all Degraded/Failed have non-zero. This might not always be true (if I am able to kick off a battery learn cycle, that might give more info).
storcli output also includes a "State" attribute under "Cachevault_Info" or "BBU_Info". The ones I see are "Optimal", "Dgd (Needs Attention)", "Dgd" and "Failed"
After kicking off a learn cycle on cards 8 and 9 in the table above, the BBU Status on both went from 32 to 112 with MegaCLI showing:
Charging Status : Discharging
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : Yes
Learn Cycle Status : OK
Learn Cycle Timeout : No
This suggests that 16 and 64 are "Discharging" and "Learn Cycle Active" (not necessarily in that order)
On card 7 the status has gone from 40 to 114, and:
Charging Status : Discharging
Voltage : Low
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : Yes
Learn Cycle Status : OK
Learn Cycle Timeout : No
This suggests that 2 = Voltage Low
FWIW, there is some quite useful code to use as a guide in https://github.com/libstorage/libstoragemgmt/blob/master/plugin/megaraid/megaraid.py. Unfortunately I did not yet find any definitive information about the battery status value bitmap.
Summary: it appears the bits are as follows.
- 1 = ?
- 2 = Voltage Low
- 4 = ?
- 8 = Charging
- 16 = Discharging
- 32 = Learn Cycle Requested
- 64 = Learn Cycle Active
- 128 = ?
- 256 = ?
- 512 = ?
- 1024 = Battery replacement required
- 2048 = Total failure to communicate with BBU
It's possible 16 and 64 are swapped, but the positions as shown make make logical sense.
I made PR #20 accept 0 and 8 as "healthy", instead of 0 and 32.
I really appreciated this thread. I'm just thinking... wouldn't it make more sense to just export the value, instead of trying to translate it in the code? Or maybe not "instead", but "as well"?
I thought that too, although promQL doesn't have any bitwise operators. != 0
would probably do for most use cases, but this inverts the sense of megaraid_battery_backup_healthy
. Another option would be to export each bit of the status word as a separate metric with different labels, although we don't know all the bit meanings yet, and IMO that's unnecessary complexity.
Perhaps a compromise is:
- Add
megaraid_bbu_status
as a raw numeric metric, and let the user do what they like with it - Keep the
megaraid_battery_backup_healthy
flag but take it fromCachevault_Info.State
orBBU_Info.State
. That is: return 1 if the state is "Optimal" and 0 for anything else.
Note: in principle you could see multiple entries under Cachevault_Info
or BBU_Info
because they are arrays - although I've never seen it in practice, I note that megaraid_cv_temperature
has a cvidx
label to distinguish them. But that's easy: only return healthy if they are all "Optimal".
Yes, I'm definitely convinced the first compromise should be in. We can leave a remark in the code about the uncertainty of the "BBU Status" value, and point it to this thread - I hope the reverse engineering doesn't stop here, though...
As for the second compromise, that's exactly what you can see in PR #32 in metric cv_state_optimal
. I just missed the BBU schema and didn't want to push code without any kind of testing. If you can provide me snippets of BBU_Info, I'd be happy add it to the code (I just realised I forgot to remove the BBU mention in the argument description), so it's kind of handy if you do :-)
And yes, I do have machines with multiple controllers, and that's the way to address it (verified in PR #32).
Actually, if you don't mind, I'd change my PR a bit this way:
- Remove any parsing from Cachevault_Info in handle_megaraid_controller; because of the detailed metrics, I'll have to iterate through /call/cv anyway, so I get that stuff from there;
- The list of new metrics from PR #32 is quite extensive now, but as I have to include a couple of them from there (cv_state/battery_healthy and temperature), I simply remove the --detailed_bbu option.
- I add the BBU_Info parsing, if you provide me a couple of schemas (the more they're different the better);
.. and you could help me testing for the BBU. The reason why you would eventually mind is because I'll be ripping off part of your changes in PR #20 (the ones that move to the dedicated parser loop)
The reason why you would eventually mind is because I'll be ripping off part of your changes in PR #20 (the ones that move to the dedicated parser loop)
Not a problem. It would be good if we could get other PRs merged first (#22, #31) as there may be merge conflicts and rebasing required anyway; that then leaves #20 / #32 for BBU
Summary: it appears the bits are as follows.
* 1 = ? * 2 = Voltage Low * 4 = ? * 8 = Charging * 16 = Discharging * 32 = Learn Cycle Requested * 64 = Learn Cycle Active * 128 = ? * 256 = ? * 512 = ? * 1024 = Battery replacement required * 2048 = Total failure to communicate with BBU
It's possible 16 and 64 are swapped, but the positions as shown make make logical sense.
I made PR #20 accept 0 and 8 as "healthy", instead of 0 and 32.
16 indeed seems to be discharging status according to one of our controllers:
BBU status for Adapter: 0
BatteryType: CVPM05
Voltage: 9836 mV
Current: 0 mA
Temperature: 33 C
Battery State: Optimal
BBU Firmware Status:
Charging Status : Discharging
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
"Status" : {
"Controller Status" : "Optimal",
"Memory Correctable Errors" : 0,
"Memory Uncorrectable Errors" : 0,
"ECC Bucket Count" : 0,
"Any Offline VD Cache Preserved" : "No",
"BBU Status" : 16,
"PD Firmware Download in progress" : "No",
"Lock Key Assigned" : "No",
"Failed to get lock key on bootup" : "No",
"Lock key has not been backed up" : "No",
"Bios was not detected during boot" : "No",
"Controller must be rebooted to complete security operation" : "No",
"A rollback operation is in progress" : "No",
"At least one PFK exists in NVRAM" : "No",
"SSC Policy is WB" : "No",
"Controller has booted into safe mode" : "No",
"Controller shutdown required" : "No",
"Current Personality" : "RAID-Mode "
},
As long as the Battery state is reported as "Optimal", I don't see a reason why discharging State should be a sign of an unhealthy BBU...
Anyone have luck finding a list of values for BBU Status? I have a bunch of cards with values 8192, 8224 or 8256. I believe it's related to battery learn operations, but am open to advice for other places to look. Examples:
Card A
Was reporting:
storcli /cALL show all J | grep "BBU Status"
"BBU Status" : 8192,
storcli /cALL show all J | grep "BBU_Info" -A 8 | grep "Next Learn\|Optimal"
"State" : "Optimal",
"Next Learn" : "2021/03/21 21:13:21"
However, after initiating a learn command since that date is in the past:
storcli /c0/bbu start learn
Controller = 0
Status = Success
Description = None
BBU_Set_Prop :
============
------------------------
BBU-Prop Description
------------------------
Start Learn Success
The value has changed to 8256:
storcli /cALL show all J | grep "BBU Status"
"BBU Status" : 8256,
Card B
Is reporting:
storcli /cALL show all J | grep "BBU Status"
"BBU Status" : 8224,
storcli /cALL show all J | grep "BBU_Info" -A 8 | grep "Next Learn\|Optimal"
"Model" : "iBBU-09",
"Next Learn" : "2020/10/03 09:34:29"
So I gave it the ol' mode change:
storcli /c0/bbu set bbumode=3
And then tried to initiate a learn:
storcli /c0/bbu start learn
Controller = 0
Status = Failure
Description = None
Detailed Status :
===============
-----------------------------------------------------
Ctrl Status Bbu-Prop ErrMsg ErrCd
-----------------------------------------------------
0 Failed Start Learn Start bbu learn failed 50
-----------------------------------------------------
Card C
Is reporting:
storcli /cALL show all J | grep "BBU Status"
"BBU Status" : 8224,
storcli /cALL show all J | grep "BBU_Info" -A 8 | grep "Next Learn\|Optimal"
"State" : "Optimal",
"Next Learn" : "2020/07/26 02:41:15"
Since it's already in mode 3 and the next learn is way in the past, I try to issue another learn command:
storcli /c0/bbu start learn
Controller = 0
Status = Failure
Description = None
Detailed Status :
===============
-----------------------------------------------------
Ctrl Status Bbu-Prop ErrMsg ErrCd
-----------------------------------------------------
0 Failed Start Learn Start bbu learn failed 50
-----------------------------------------------------
Side question though, anyone ever deal with learn errcd 50? :D
I have a bunch of cards with values 8192, 8224 or 8256
8224 = 8192 + 32, 8256 = 8192 + 64. Flags 32 and 64 are known (Learn Cycle Requested, Learn Cycle Active). So it's just a question of what 8192 means.
Can you install the old MegaCli64 and show the full output of /opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
? Perhaps also the "BBU_Info"
or "Cachevault_Info"
sections from storcli? Maybe by comparing those with the previously-posted examples you'll be able to identify a flag which 8192 matches.
So bad news! These three cards specifically segfault when running MegaCli64. They all have the same output, except voltage which seems in an acceptable range, the highest being 4049 mV. Card D in this case is a card that reports a 0 in BBU Status.
Card A,B,C
# MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU-09
Voltage: 3767 mV
Current: 0 mA
Temperature: 17 C
Battery State: Optimal
Segmentation fault (core dumped)
Card D
# MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: CVPM03
Voltage: 9503 mV
Current: 0 mA
Temperature: 27 C
Battery State: Optimal
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
BBU GasGauge Status: 0x62ea
Pack energy : 234 J
Capacitance : 98
Remaining reserve space : 0
Exit Code: 0x00
Segfault
I threw the MegaCli64 command behind strace and this is where it segfaults:
open("MegaSAS.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
lseek(4, 0, SEEK_END) = 68536
fstat(4, {st_mode=S_IFREG|0644, st_size=68536, ...}) = 0
write(4, "Voltage: 4048 mV\nCurrent: 0 mA\nT"..., 49) = 49
close(4) = 0
write(1, "Battery State: Optimal\n", 23Battery State: Optimal
) = 23
open("MegaSAS.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
lseek(4, 0, SEEK_END) = 68585
fstat(4, {st_mode=S_IFREG|0644, st_size=68585, ...}) = 0
write(4, "Battery State: Optimal\n", 23) = 23
close(4) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x4d, 0x01, 0x194), 0x20cf1d0) = 0
ioctl(3, _IOC(_IOC_READ|_IOC_WRITE, 0x4d, 0x01, 0x194), 0x20cf1b0) = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
The MegaSAS.log file contains the same output as the command, minus the Segfault:
# cat /root/MegaSAS.log
BBU status for Adapter: 0
BatteryType: iBBU-09
Voltage: 4049 mV
Current: 0 mA
Temperature: 17 C
Battery State: Optimal
I'm honestly not sure what to do next.
Ah neat, so now we know 8192 means "crash MegaCli" :-)
What about the "BBU_Info" or "Cachevault_Info" sections from storcli? Is there anything obvious there?
Otherwise, I haven't seen 4096 and 8192, and these might be flags specific to a new model of card or BBU type, which MegaCli doesn't understand.
Your card A,B,C have "iBBU-09" which doesn't match any of mine. Your card D has "CVPM03" (which I have seen - cards 27 and 28 in the table I posted)
Nice catch on the CVPM03, I unluckily chose the one instance that is explicitly different. I've included a functioning identical node's output just to keep it sane:
CARD E:
# MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL
BBU status for Adapter: 0
BatteryType: iBBU-09
Voltage: 3846 mV
Current: 0 mA
Temperature: 23 C
Battery State: Optimal
Design Mode : 24+ Hrs retention with a transparent learn cycle and balanced service life.
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : No
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : No
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
BBU GasGauge Status: 0x0180
Relative State of Charge: 64 %
Charger System State: 1
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 49 %
Max Error: 0 %
Exit Code: 0x00
As far as BBU_Info, there's nothing obvious, except next learn is in the past for 2 out of 3. My first update to this issue (https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/27#issuecomment-804982414) I mentioned that initiating a learn on any of Card A,B,C throws this error that I still can not track down.
-----------------------------------------------------
Ctrl Status Bbu-Prop ErrMsg ErrCd
-----------------------------------------------------
0 Failed Start Learn Start bbu learn failed 50
-----------------------------------------------------