Issues with go-native IPMI collector
I just tested the new go-native IPMI collectors on 2 machines that i have running at home and found some issues. I did my tests using the binaries downloaded from the 1.10.0 release page running on Debian Bookworm. The exporter was running directly on the servers using the "local" connection. The configuration file had this contents
modules:
default:
collectors:
- bmc
- ipmi
- chassis
- sel
The biggest issue was that on one of the machines the BMC collector consistently crashed (SIGSEGV) when being scraped. The log output from the crash can be found here together with some info about the hardware and firmware versions. Once i deactivated the BMC collector via the configuration i was able to scrape metrics from the exporter. However with every scrape i got the following log messages (4 entries per scrape):
level=ERROR source=collector_ipmi_native.go:178 msg="Unknown sensor state" target=[local] state=0x0080
I tried disabling some sensors (using exclude_sensor_ids: in the config) but wasn't able to figure out which sensor ids actually triggered the issue. It would be great if the log line would contain the sensor id in question. Also i found it interesting that the sensor-ids reported by the native implementation differ quite a bit from the ids reported by freeipmi.
On the second host the BMC collector did work. All the info about this host can be found here. Every scrape on this host produced a single log line like this:
level=ERROR source=collector_ipmi_native.go:178 msg="Unknown sensor state" target=[local] state=0x04ff
Again i was not able to find the sensor id that actually triggers this.
If there is any other test i should do let me know. At my day-job i have access to several other kinds of servers with IPMI that i could test. Those are mostly Dell R430,R440 and R450 but also some HP servers. Although i won't be able to get to it for a while because i have a very busy month ahead of me.
Hi there,
thanks a lot for testing the new native collectors, it is much appreciated. I have to admit that I rarely use local mode at all, so the native mode is not well-tested with that at all. I am also quite time-constrained at the moment, but I'll try to dig into this soon. One question, though:
I just tested the new go-native IPMI collectors on 2 machines that i have running at home
What type of hardware are those? Also Dell?
I tried disabling some sensors (using exclude_sensor_ids: in the config) but wasn't able to figure out which sensor ids actually triggered the issue. It would be great if the log line would contain the sensor id in question.
Indeed, will look into it.
Also i found it interesting that the sensor-ids reported by the native implementation differ quite a bit from the ids reported by freeipmi.
If you really care to know, this is because in IPMI sensors only have "numbers", and they are not guaranteed to be unique, so both go-ipmi and FreeIPMI "make up" the ID, but they do it differently. Annoying, but in the end it shouldn't matter (assuming of course the error message would tell you the ID so you can add it to exclude 😉 )
Hi there,
thanks a lot for testing the new native collectors, it is much appreciated. I have to admit that I rarely use local mode at all, so the native mode is not well-tested with that at all.
Yeah i know it's quite strange to monitor the IPMI this way, given that any time the server's OS is dead or unreachable you loose metrics from it. However for quite a lot of deployments that i have to work with i have very limited access to the VLAN where the IPMI is connected to. A lot of the time i can only reach the Web UI of the BMC via a combination of VPNs and some kind of remote desktop environment. This means i sadly cannot integrate the IPMI to our monitoring infrastructure using the LAN protocol. To be honest the only reason i run the ipmi-exporter is to get informed if one of the power supplies has lost power or a fan fails. Most of my server deployments are on the edge where loosing one power rail is quite common and for this use-case running the exporter on the machine itself is imho still a good idea.
I am also quite time-constrained at the moment, but I'll try to dig into this soon. One question, though:
I just tested the new go-native IPMI collectors on 2 machines that i have running at home
What type of hardware are those? Also Dell?
No the one where the BMC collector crashes is an ASRockRack X570D4U which uses an Aspeed AST2500. Judging by the looks of the web interface the firmware running on the BMC is probably coming from Aspeed. I know of at least one other mainboard by a different vendor (Gigabyte) that uses the same chip and the web UI of this board looks very similar to the one of the ASRockRack board.
The second server i tested is from Supermicro which uses the same AST2500 chip but at least the Web UI of this board looks quite different. So i'm guessing the firmware running on that BMC has probably been much more adapted by Supermicro.
In the next few days i will have to install 3 other servers for a project i'm working on. One of these servers has an ASRockRack X470D4U, so the previous generation of the one i had trouble with. The other two servers are identical, based on a Supermicro board that is quite old at this point (don't know the exact name right now). I won't be able to do a lot of testing but running the exporter using the new native collectors at least for a while should be possible. I will report here what i found out.
I tried disabling some sensors (using exclude_sensor_ids: in the config) but wasn't able to figure out which sensor ids actually triggered the issue. It would be great if the log line would contain the sensor id in question.
Indeed, will look into it.
Great Thanks!
Also i found it interesting that the sensor-ids reported by the native implementation differ quite a bit from the ids reported by freeipmi.
If you really care to know, this is because in IPMI sensors only have "numbers", and they are not guaranteed to be unique, so both go-ipmi and FreeIPMI "make up" the ID, but they do it differently. Annoying, but in the end it shouldn't matter (assuming of course the error message would tell you the ID so you can add it to exclude 😉 )
Hmmm ok, i already figured it must be something like this. I agree it shouldn't matter to much as long as the numbers that are "made up" by the native collectors are consistent across reboots (and hopefully re-installation of the server?) it shouldn't be big deal. In this case i have to edit my list of excluded IDs only once when i change from freeipmi to the new collectors.
I tried to test it, at this point in time I can say that the native exporter practically does not work. The example, first scrape returns 523 strings, the second scrape return that all collectors down
[root@prom]# curl -Ss "http://127.0.0.1:9494/ipmi?module=dell&target=172.16.18.9" | wc -l
523
[root@prom]# curl -Ss "http://127.0.0.1:9494/ipmi?module=dell&target=172.16.18.9" | wc -l
9
[root@prom]# curl -Ss "http://127.0.0.1:9494/ipmi?module=dell&target=172.16.18.9" | wc -l
9
[root@prom]# curl -Ss "http://127.0.0.1:9494/ipmi?module=dell&target=172.16.18.9" | wc -l
9
Errors on exporter side (Dell):
time=2025-03-14T11:19:43.530Z level=ERROR source=collector.go:168 msg="Error connecting to IPMI device" target=172.16.18.9 error="connect20 failed after try all cipher suite ids ([17 3]), errs: \ncmd: RMCP+ Open Session failed with cipher suite id (17), err: rakp status code error: (0x01) Insufficient resources to create a session\ncmd: RMCP+ Open Session failed with cipher suite id (3), err: rakp status code error: (0x01) Insufficient resources to create a session"
time=2025-03-14T11:19:43.530Z level=ERROR source=collector.go:120 msg="Collector failed" name=bmc error="connect20 failed after try all cipher suite ids ([17 3]), errs: \ncmd: RMCP+ Open Session failed with cipher suite id (17), err: rakp status code error: (0x01) Insufficient resources to create a session\ncmd: RMCP+ Open Session failed with cipher suite id (3), err: rakp status code error: (0x01) Insufficient resources to create a session"
Or (Supermicro)
time=2025-03-14T11:42:50.958Z level=ERROR source=collector.go:168 msg="Error connecting to IPMI device" target=172.16.18.8 error="connect20 failed after try all cipher suite ids ([3 6 7 8 11 12]), errs: \ncmd: RMCP+ Open Session failed with cipher suite id (3), err: client exchange failed, err: unpack session setup response failed, err: unpacked data is too short (7/8)\ncmd: RMCP+ Open Session failed with cipher suite id (6), err: client exchange failed, err: unpack session setup response failed, err: unpacked data is too short (7/8)\ncmd: RMCP+ Open Session failed with cipher suite id (7), err: client exchange failed, err: unpack session setup response failed, err: unpacked data is too short (7/8)\ncmd: RMCP+ Open Session failed with cipher suite id (8), err: client exchange failed, err: unpack session setup response failed, err:
unpacked data is too short (7/8)\ncmd: RMCP+ Open Session failed with cipher suite id (11), err: client exchange failed, err: unpack session setup response failed, err: unpacked data is too short (7/8)\ncmd: RMCP+ Open Session failed with cipher suite id (12), err: client exchange failed, err: unpack session setup response failed, err: unpacked data is too short (7/8)"
After this messages, the production ipmi_exporter can't connect to ipmi collector for ~ 5 minutes with error 🥲:
Mar 14 18:48:01 prometheus.example.com ipmi_exporter[3979970]: ts=2025-03-14T11:48:01.252Z caller=collector_bmc.go:53 level=error msg="Failed to collect BMC data" target=172.16.18.8 error="error running bmc-info: exit status 1: ipmi_ctx_open_outofband_2_0: BMC busy\n"
Also, native exporter returns many "unknown sensor state" errors. Debug attached
time=2025-03-14T11:19:43.530Z level=ERROR source=collector.go:168 msg="Error connecting to IPMI device" target=172.16.18.9 error="connect20 failed after try all cipher suite ids ([17 3]), errs: \ncmd: RMCP+ Open Session failed with cipher suite id (17), err: rakp status code error: (0x01) Insufficient resources to create a session\ncmd: RMCP+ Open Session failed with cipher suite id (3), err: rakp status code error: (0x01) Insufficient resources to create a session" time=2025-03-14T11:19:43.530Z level=ERROR source=collector.go:120 msg="Collector failed" name=bmc error="connect20 failed after try all cipher suite ids ([17 3]), errs: \ncmd: RMCP+ Open Session failed with cipher suite id (17), err: rakp status code error: (0x01) Insufficient resources to create a session\ncmd: RMCP+ Open Session failed with cipher suite id (3), err: rakp status code error: (0x01) Insufficient resources to create a session"
Hi! I have also encountered this issue. During this time period, tools like ipmimonitoring and ipmitool cannot be used.
ipmitool reports an error: Error in open session response message : insufficient resources for session. Error: Unable to establish IPMI v2 / RMCP+ session.
ipmimonitoring reports an error: BMC Busy.
And the error persists for more than 10 minutes without recovery. The BMC seems to have completely crashed🥲.
Is there any new progress on this?
When I use ipmitool -H <targetIP> -U <user> -P <password> -Ilanplus session info active to retrieve active session information.
session handle : 121
slot count : 36
active sessions : 33
user id : 2
privilege level : ADMINISTRATOR
session type : IPMIv2/RMCP+
channel number : 0x01
console ip : 180.76.30.98
console mac : 00:00:89:89:00:01
console port : 21391
when i get metrics next time, ipmi-exporter will error. And the number of active sessions keeps increasing.
Is the native_client not closing the session after the request ends?
Trying to run 1.10.1 as daemonset on my k8s cluster to query the local ipmi.
Config:
modules:
default:
driver: local
privilege: user
timeout: 10000
Logs:
time=2025-07-31T12:14:34.184Z level=INFO source=main.go:107 msg="Starting ipmi_exporter" version="(version=1.10.1, branch=HEAD, revision=291d0107bdab44df5a7ff50ce552ee3ccc23b52a)"
time=2025-07-31T12:14:34.184Z level=INFO source=main.go:109 msg="Using Go-native IPMI implementation - this is currently EXPERIMENTAL"
time=2025-07-31T12:14:34.184Z level=INFO source=main.go:110 msg="Make sure to read https://github.com/prometheus-community/ipmi_exporter/blob/master/docs/native.md"
time=2025-07-31T12:14:34.184Z level=INFO source=config.go:281 msg="Loaded config file" path=/config.yml
time=2025-07-31T12:14:34.185Z level=INFO source=tls_config.go:347 msg="Listening on" address=[::]:9290
time=2025-07-31T12:14:34.185Z level=INFO source=tls_config.go:350 msg="TLS is disabled." http2=false address=[::]:9290
time=2025-07-31T12:14:36.726Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x0180 sensor_id=174
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x0180 sensor_id=172
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x4080 sensor_id=237
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0xc000 sensor_id=2
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0xc000 sensor_id=3
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x8080 sensor_id=228
time=2025-07-31T12:14:36.727Z level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x8080 sensor_id=227
panic: runtime error: slice bounds out of range [:17] with capacity 16
goroutine 65 [running]:
github.com/bougou/go-ipmi.getSystemInfoStringMeta({0xc00038a0d0?, 0x7f09c48880a8?, 0x10?})
/go/pkg/mod/github.com/bougou/[email protected]/cmd_get_system_info_params.go:387 +0x4b3
github.com/bougou/go-ipmi.(*SystemInfoParams).ToSystemInfo(0xc000493510)
/go/pkg/mod/github.com/bougou/[email protected]/types_system_info_params.go:104 +0xc5
main.BMCNativeCollector.Collect({}, {{0x0, 0x0, 0x0}, {_, _}}, _, {{0x0, 0x0}, {{0x0, ...}, ...}})
/app/collector_bmc_native.go:76 +0x3c5
main.ConfiguredCollector.Collect(...)
/app/config.go:69
main.metaCollector.Collect({{0x0, 0x0}, {0xa9374a, 0x7}, 0xf7f120}, 0xc000316a80)
/app/collector.go:118 +0x5b6
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:458 +0xe5
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 29
/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:548 +0xbab
with non-native ipmi and driver=OPENIPMI it works.
Hi, similar to to the above,I see this error in the latest version also using local native exporter.
level=ERROR source=collector_ipmi_native.go:181 msg="Unknown sensor state" target=[local] state=0x4080 sensor_id=214
except that, once I started excluding sensors mentioned by the sensor_id, it appears to be in fact be nearly every single sensor, including ones I am otherwise successfully scraping.
Looking inside go-ipmi, I see the implementation of Status() is as follows:
func (sensor *Sensor) Status() string {
if sensor.notPresent {
return "N/A"
}
if sensor.scanningDisabled {
return "N/A"
}
if !sensor.IsReadingValid() {
return "N/A"
}
if sensor.IsThreshold() {
return string(sensor.Threshold.ThresholdStatus)
}
return fmt.Sprintf("0x%02x%02x", sensor.Discrete.optionalData1, sensor.Discrete.optionalData2)
}
As such, it looks like this output of 0xXXXX is going to be a pretty ordinary thing if a sensor is present, scanning enabled, has a valid reading, and is not a threshold. My experience with IPMI is not significant, but my inclination is that the log warning here:
logger.Error(
"Unknown sensor state",
"target", targetHost,
"state", data.Status(),
"sensor_id", strconv.FormatInt(int64(data.Number), 10),
)
should simply be removed.