node-exporter-textfile-collector-scripts
node-exporter-textfile-collector-scripts copied to clipboard
ADD storcli detailed CacheVault metrics
There doesn't seem to be any documentation around the "BBU Status" field and, unlike what @mulbc describes on 1b98db9fa72abe93541fb1a7140388504601e303, we experience it now in several states other than [0,32] (16, 64, etc) which we cannot understand; and they last for random multiple days durations.
I therefore followed his suggestion of parsing the detailed output, but only for CacheVault, not BBU. I cannot find now a machine with BBU, so that'll come later.
See https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/pull/20
Hi @dswarbrick thanks for sharing! The refered issues are very enlightening in regards to reverse-engineering the BBU Status :-) Indeed I couldn't get feedback from the vendor either but, there seems to be some progress for us!
In any case, this PR adds detailed CV metrics, so we're changing our alerts anyway for the time being. Maybe you want to add the /call/bbu ? :-)
Refactor a bit based on the discussing in #27:
- cv_temperature and battery_backup_healthy now come from the dedicated CV/BBU parser loops, instead of the "/call show all" (handle_megaraid_controller) parser;
- Note that the cvidx parameter is removed: the output on a multiple controller box is NOT a list, which seems to indicate the CV (at least) is indexed to the controller (IOW, one controller can only have one CacheVault);
- I've "introduced" megaraid_bbu_status as the plain value from "BBU Status", with the remark that its interpretation is not yet clear. There's a nice reverse-engineering going on here: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/27#issuecomment-567884906 , but at least we can start graphing the value and correlate it with other metrics.
@candlerb In the meantime I (finally) found a server with a BBU, so I've added the "detailed" metrics as well. For the BBU, there's lots more of eventually interesting stuff under 'BBU_Firmware_Status', but for now I think it's good.
My general observation about this PR is there seems to be a lot of information returned about BBU here and I'm not sure it's all necessary. It means we probably know more about the BBU than the adapter itself! To put it another way: does everything which appears in the JSON have to be exposed as a metric?
If the state is non-optimal then the user can investigate directly on the card. e.g. do we really need a separate metric for "microcode update required"? That state is probably one of the flags in bbu_status anyway, so users who come across this condition will learn what that status value means.
Alternatively: these flags could go as labels into cv_info, in the same way as pd_info has state="Onln" or state="GHS" as a label, rather than as a separate metric per flag.
OTOH: I agree it is consistent to have cv_info / bbu_info with serial number etc, in the same way as the adapter info.
Hi @candlerb, thanks for taking the time to go through this.
It means we probably know more about the BBU than the adapter itself!
I don't see a problem with that :-) I got curious about the ability to track battery capacitance during its lifetime, it could help predict replacements.
That state is probably one of the flags in bbu_status anyway
That might be very true, however until "BBU Status" is completely reverse engineered, no one can tell for sure. Look, I don't even really know what they mean, I just sensed they could point to a problem somehow - and if that's the case, I'd rather have prometheus tell it to me immediatelly rather than having to issue a storcli64 command and dig that out from a huge listing of parameters.
Alternatively: these flags could go as labels into cv_info, in the same way as pd_info has state="Onln" or state="GHS" as a label, rather than as a separate metric per flag.
Well considering what I know about those metrics, I'd actually have done that, if I had thought about it myself. But we're talking about saving 2 metrics of the whole bunch, and we go back to the need of running storcli64 yourself to figure it out, as I wrote above. I suggest we stick with independent metrics.
PS: I also wanted to say that I extended this a bit with the "BBU Status" reverse engineering in mind, that is: maybe we graph the whole thing, and revese engineer "BBU Status" just by looking at the graphs :-)
@ntavares Please rebase on master and resolve any merge conflicts.