snmp_exporter
snmp_exporter copied to clipboard
Walking PDU table breaks some values on HPE PDU
Host operating system: output of uname -a
Linux prometheus 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64 GNU/Linux
snmp_exporter version: output of snmp_exporter -version
Make failed due to some dependencies, but I'm running at commit 1b3b0539ec70d692ec1d405345e685a6368c55d1
What device/snmpwalk OID are you using?
HPE PDU AF400A (1.00.A, firmware version 1.00.14).
OID 1.3.6.1.4.1.232.165.2.3.2.1.4 / breakerPercentLoad.
Failing config (the most minimal I could make):
modules:
hpe-pdumm:
walk:
- pdu
Good reference config (the most minimal I could make):
modules:
hpe-pdumm:
walk:
- breakerPercentLoad
If this is a new device, please link to the MIB(s).
CPQPOWER-MIB.txt, e.g. from https://github.com/kamelnetworks/hp-pdu-prometheus
What did you do that produced an error?
$ curl -q 'localhost:9116/snmp?target=172.18.0.10&module=hpe-pdumm' | grep breakerPercentLoad
What did you expect to see?
# HELP breakerPercentLoad The breaker load in percent. - 1.3.6.1.4.1.232.165.2.3.2.1.4
# TYPE breakerPercentLoad gauge
breakerPercentLoad{breakerIndex="1",pduOutputIndex="1"} 6
breakerPercentLoad{breakerIndex="1",pduOutputIndex="2"} 6
breakerPercentLoad{breakerIndex="2",pduOutputIndex="1"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="2"} 5
breakerPercentLoad{breakerIndex="3",pduOutputIndex="1"} 8
breakerPercentLoad{breakerIndex="3",pduOutputIndex="2"} 8
[breakerIndex=>3 removed, they are just zero]
What did you see instead?
# HELP breakerPercentLoad The breaker load in percent. - 1.3.6.1.4.1.232.165.2.3.2.1.4
# TYPE breakerPercentLoad gauge
breakerPercentLoad{breakerIndex="1",pduOutputIndex="1"} 6
breakerPercentLoad{breakerIndex="1",pduOutputIndex="2"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="1"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="2"} 0
breakerPercentLoad{breakerIndex="3",pduOutputIndex="1"} 8
breakerPercentLoad{breakerIndex="3",pduOutputIndex="2"} 0
[breakerIndex=>3 removed, they are just zero]
The error is that pduOutputIndex="2" shows 0 on load.
Note that snmpwalk and snmpbulkwalk shows no issues, the values are fine regardless if I walk breakerPercentLoad or pdu.
My current workaround is to list them one-by-one: https://gist.github.com/bluecmd/f2c52dabaf5fbaac83a45291c2d6b7cc That seems to work for now until the issue is resolved.
What does snmpbulkwalk with a Cr of 25 show? This is probably a device bug.
Thanks, yes - quite clearly it is a device bug.
I have played around with various parameters and bulk walk seems to be quite shaky on this device.
$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL .1 # Correct
$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL pdu # Hangs and times out after CPQPOWER-MIB::breakerPercentLoad.1.4
$ snmpbulkwalk -Cr10 -v 2c -c [xx] 172.18.0.10 -m ALL pdu # Hangs and times out at breakerStatus
$ snmpbulkwalk -Cr5 -v 2c -c [xx] 172.18.0.10 -m ALL pdu # Hangs
$ snmpbulkwalk -Cr4 -v 2c -c [xx] 172.18.0.10 -m ALL pdu # Correct
$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL breakerPercentLoad # Works
I was only looking at breakerPercentLoad before, so I missed the timeout in breakerStatus in the previous gist.
-Cr4 seems to be the highest I can poll pdu without timeout. I set that in my config and it seems to work from what I can tell.
From the issue PoV, I think this can be closed.
Feature request
While debugging this I noticed that snmp_exporter seems to not really surface these timeouts when the device is not behaving - is that working as intended? The log seems to be quite empty, and the only metric I could maybe use I think is snmp_scrape_pdus_returned - but without knowing the golden value it is not really usable to detect an issue I feel.
The log with --log.level=debug shows only this:
level=debug ts=2020-06-27T19:56:14.793Z caller=main.go:99 module=hpe-pdumm target=172.18.0.10 msg="Starting scrape"
level=debug ts=2020-06-27T19:56:14.793Z caller=collector.go:164 module=hpe-pdumm target=172.18.0.10 msg="Walking subtree" oid=1.3.6.1.4.1.232.165.2
level=debug ts=2020-06-27T19:56:15.635Z caller=collector.go:177 module=hpe-pdumm target=172.18.0.10 msg="Walk of subtree completed" oid=1.3.6.1.4.1.232.165.2 duration_seconds=841.845929ms
level=debug ts=2020-06-27T19:56:15.637Z caller=main.go:110 module=hpe-pdumm target=172.18.0.10 msg="Finished scrape" duration_seconds=0.844159811
while snmpbulkwalk -Cr25 clearly says:
CPQPOWER-MIB::breakerCurrent.2.5 = INTEGER: 0
CPQPOWER-MIB::breakerCurrent.2.6 = INTEGER: 0
CPQPOWER-MIB::breakerPercentLoad.1.1 = INTEGER: 6
CPQPOWER-MIB::breakerPercentLoad.1.2 = INTEGER: 0
CPQPOWER-MIB::breakerPercentLoad.1.3 = INTEGER: 8
CPQPOWER-MIB::breakerPercentLoad.1.4 = INTEGER: 0
Timeout: No Response from 172.18.0.10
So I guess in summary:
- Why does
snmp_exportertreat this condition as 0 value? - Why doesn't
snmp_exporterexpose some "something happened" log or metric?
While debugging this I noticed that snmp_exporter seems to not really surface these timeouts when the device is not behaving - is that working as intended?
That doesn't sound right. If there's a timeout that should result in a failed scrape. I'd suggest checking what the device is returning via tcpdump to see exactly where the issue lies.
Looking and comparing the pcaps from snmpbulkwalk and snmp_exporter they look the same up until the device sends what Wireshark considers to be an invalid SNMP packet. The failing OID is CPQPOWER-MIB::breakerPercentLoad.1.5 or .1.3.6.1.4.1.232.165.2.3.2.1.4.1.5 and that is with max-repetitions 25 apparently on the border between requests.
Simple Network Management Protocol
version: v2c (1)
community: [xxx]
data: get-response (2)
get-response
request-id: -1851290032
error-status: noError (0)
error-index: 24
variable-bindings: 25 items
1.3.6.1.4.1.232.165.2.3.2.1.4.1.5:
Object Name: 1.3.6.1.4.1.232.165.2.3.2.1.4.1.5 (iso.3.6.1.4.1.232.165.2.3.2.1.4.1.5)
Integral value is zero-length
[Expert Info (Note/Undecoded): Integral value is zero-length]
[Integral value is zero-length]
[Severity level: Note]
[Group: Undecoded]
Comparing the behavior on this packet with snmpbulkwalk it seems like snmp_exporter claims to be able to process this packet and does not error out, while snmpbulkwalk sees it as an issue and retries the fetch (which returns the same broken response) and eventually reports this as a timeout.
Here is the relevant output (IMO) of snmpbulkwalk -DALL:
trace: snmp_parse_var_op(): snmp.c, 164:
dumph_recv: Name
dumpx_recv: 06 10 2B 06 01 04 01 81 68 81 25 02 03 02 01 04
01 05
dumpv_recv: ObjID: CPQPOWER-MIB::breakerPercentLoad.1.5
trace: snmp_pdu_parse(): snmp_api.c, 4515:
dumph_recv: Value
trace: _sess_process_packet(): snmp_api.c, 5249:
sess_process_packet: received message id#0 reqid#706927312 len 555
trace: _sess_process_packet(): snmp_api.c, 5252:
sess_process_packet: parse fail
trace: snmp_sess_select_info2_flags(): snmp_api.c, 6051:
sess_select: for all sessions: 3 (to in 2279843.432377 sec)
verbose:sess_select: timer due in 0.765869 sec
verbose:sess_select: setting timer to 0.765869 sec, clear block (was 1)
trace: snmp_synch_input(): snmp_client.c, 183:
snmp_synch: Response (ReqID: 706927312 - Cmd 165)
Timeout: No Response from 172.18.0.10
https://github.com/rknall/wireshark/blob/aeab984510c521fbef9346aa604ea7e2b88dd41c/asn1/snmp/packet-snmp-template.c#L983 is where that's coming from, so this sounds like a bug upstream. Though I can also see how treating that as a 0 might be valid, I've not had to dig into this level of X509 detail on an SNMP issue before.
https://github.com/soniah/gosnmp/blob/60a0d22031649a67c07b71d8ed989b135a5f9715/helper.go#L533 is the relevant bit of upstream code. Someone would need to dig into the SNMP/X509 spec to see if this is invalid, and then address it upstream.
Am I correct in reading that you consider this only a device bug, and you're not concerned about the difference in operation between NetSNMP and snmp_exporter? I thought that would be the major concern -- mis-parsing invalid data and presenting it as correct.
Having casually browsed around SNMP BER encoding it definitely seems like zero-length encoding is not valid.
Anyway, this is the limit of how much time I will spend on the issue - I'll leave it to you authors of snmp_exporter to decide the next steps. I'm willing to provide pcap dumps if requested, but they contain potential sensitive data so it would have to be a request in that case.
The device here has a bug anyway.
It's also possible that our upstream has a bug, can you point me to the relevant docs?
The ones I looked at are:
- X.690 section 8.3.1 "Encoding of an integer value"
- Some discussion here
- Cross-reference that X.690 is listed as used for SNMP
- Sun BER encoding docs listing NULL as a zero-length possibility, but the type Integer is >0 length.
- Let's Encrypt on ASN.1 and BER and INTEGERS specifically
Thanks, that looks fairly clear. I've filed an issue upstream.
The upstream issue has been resolved. Once we bump the Go modules, we can cut a new release.