snmp_exporter icon indicating copy to clipboard operation
snmp_exporter copied to clipboard

Walking PDU table breaks some values on HPE PDU

Open bluecmd opened this issue 5 years ago • 12 comments

Host operating system: output of uname -a

Linux prometheus 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2 (2020-04-29) x86_64 GNU/Linux

snmp_exporter version: output of snmp_exporter -version

Make failed due to some dependencies, but I'm running at commit 1b3b0539ec70d692ec1d405345e685a6368c55d1

What device/snmpwalk OID are you using?

HPE PDU AF400A (1.00.A, firmware version 1.00.14). OID 1.3.6.1.4.1.232.165.2.3.2.1.4 / breakerPercentLoad.

Failing config (the most minimal I could make):

modules:
  hpe-pdumm:
    walk:
      - pdu

Good reference config (the most minimal I could make):

modules:
  hpe-pdumm:
    walk:
      - breakerPercentLoad

If this is a new device, please link to the MIB(s).

CPQPOWER-MIB.txt, e.g. from https://github.com/kamelnetworks/hp-pdu-prometheus

What did you do that produced an error?

$  curl -q 'localhost:9116/snmp?target=172.18.0.10&module=hpe-pdumm' | grep breakerPercentLoad

What did you expect to see?

# HELP breakerPercentLoad The breaker load in percent. - 1.3.6.1.4.1.232.165.2.3.2.1.4
# TYPE breakerPercentLoad gauge
breakerPercentLoad{breakerIndex="1",pduOutputIndex="1"} 6
breakerPercentLoad{breakerIndex="1",pduOutputIndex="2"} 6
breakerPercentLoad{breakerIndex="2",pduOutputIndex="1"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="2"} 5
breakerPercentLoad{breakerIndex="3",pduOutputIndex="1"} 8
breakerPercentLoad{breakerIndex="3",pduOutputIndex="2"} 8
[breakerIndex=>3 removed, they are just zero]

What did you see instead?

# HELP breakerPercentLoad The breaker load in percent. - 1.3.6.1.4.1.232.165.2.3.2.1.4
# TYPE breakerPercentLoad gauge
breakerPercentLoad{breakerIndex="1",pduOutputIndex="1"} 6
breakerPercentLoad{breakerIndex="1",pduOutputIndex="2"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="1"} 0
breakerPercentLoad{breakerIndex="2",pduOutputIndex="2"} 0
breakerPercentLoad{breakerIndex="3",pduOutputIndex="1"} 8
breakerPercentLoad{breakerIndex="3",pduOutputIndex="2"} 0
[breakerIndex=>3 removed, they are just zero]

The error is that pduOutputIndex="2" shows 0 on load.

Note that snmpwalk and snmpbulkwalk shows no issues, the values are fine regardless if I walk breakerPercentLoad or pdu.

bluecmd avatar Jun 27 '20 17:06 bluecmd

My current workaround is to list them one-by-one: https://gist.github.com/bluecmd/f2c52dabaf5fbaac83a45291c2d6b7cc That seems to work for now until the issue is resolved.

bluecmd avatar Jun 27 '20 18:06 bluecmd

What does snmpbulkwalk with a Cr of 25 show? This is probably a device bug.

brian-brazil avatar Jun 27 '20 19:06 brian-brazil

Thanks, yes - quite clearly it is a device bug.

I have played around with various parameters and bulk walk seems to be quite shaky on this device.

$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL  .1 # Correct
$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL  pdu # Hangs and times out after CPQPOWER-MIB::breakerPercentLoad.1.4
$ snmpbulkwalk -Cr10 -v 2c -c [xx] 172.18.0.10 -m ALL  pdu # Hangs and times out at breakerStatus
$ snmpbulkwalk -Cr5 -v 2c -c [xx] 172.18.0.10 -m ALL  pdu # Hangs
$ snmpbulkwalk -Cr4 -v 2c -c [xx] 172.18.0.10 -m ALL  pdu # Correct
$ snmpbulkwalk -Cr25 -v 2c -c [xx] 172.18.0.10 -m ALL  breakerPercentLoad # Works

I was only looking at breakerPercentLoad before, so I missed the timeout in breakerStatus in the previous gist.

-Cr4 seems to be the highest I can poll pdu without timeout. I set that in my config and it seems to work from what I can tell.

From the issue PoV, I think this can be closed.

Feature request

While debugging this I noticed that snmp_exporter seems to not really surface these timeouts when the device is not behaving - is that working as intended? The log seems to be quite empty, and the only metric I could maybe use I think is snmp_scrape_pdus_returned - but without knowing the golden value it is not really usable to detect an issue I feel.

The log with --log.level=debug shows only this:

level=debug ts=2020-06-27T19:56:14.793Z caller=main.go:99 module=hpe-pdumm target=172.18.0.10 msg="Starting scrape"
level=debug ts=2020-06-27T19:56:14.793Z caller=collector.go:164 module=hpe-pdumm target=172.18.0.10 msg="Walking subtree" oid=1.3.6.1.4.1.232.165.2
level=debug ts=2020-06-27T19:56:15.635Z caller=collector.go:177 module=hpe-pdumm target=172.18.0.10 msg="Walk of subtree completed" oid=1.3.6.1.4.1.232.165.2 duration_seconds=841.845929ms
level=debug ts=2020-06-27T19:56:15.637Z caller=main.go:110 module=hpe-pdumm target=172.18.0.10 msg="Finished scrape" duration_seconds=0.844159811

while snmpbulkwalk -Cr25 clearly says:

CPQPOWER-MIB::breakerCurrent.2.5 = INTEGER: 0
CPQPOWER-MIB::breakerCurrent.2.6 = INTEGER: 0
CPQPOWER-MIB::breakerPercentLoad.1.1 = INTEGER: 6
CPQPOWER-MIB::breakerPercentLoad.1.2 = INTEGER: 0
CPQPOWER-MIB::breakerPercentLoad.1.3 = INTEGER: 8
CPQPOWER-MIB::breakerPercentLoad.1.4 = INTEGER: 0
Timeout: No Response from 172.18.0.10

So I guess in summary:

  • Why does snmp_exporter treat this condition as 0 value?
  • Why doesn't snmp_exporter expose some "something happened" log or metric?

bluecmd avatar Jun 27 '20 19:06 bluecmd

While debugging this I noticed that snmp_exporter seems to not really surface these timeouts when the device is not behaving - is that working as intended?

That doesn't sound right. If there's a timeout that should result in a failed scrape. I'd suggest checking what the device is returning via tcpdump to see exactly where the issue lies.

brian-brazil avatar Jun 28 '20 01:06 brian-brazil

Looking and comparing the pcaps from snmpbulkwalk and snmp_exporter they look the same up until the device sends what Wireshark considers to be an invalid SNMP packet. The failing OID is CPQPOWER-MIB::breakerPercentLoad.1.5 or .1.3.6.1.4.1.232.165.2.3.2.1.4.1.5 and that is with max-repetitions 25 apparently on the border between requests.

Simple Network Management Protocol
    version: v2c (1)
    community: [xxx]
    data: get-response (2)
        get-response
            request-id: -1851290032
            error-status: noError (0)
            error-index: 24
            variable-bindings: 25 items
                1.3.6.1.4.1.232.165.2.3.2.1.4.1.5: 
                    Object Name: 1.3.6.1.4.1.232.165.2.3.2.1.4.1.5 (iso.3.6.1.4.1.232.165.2.3.2.1.4.1.5)
                    Integral value is zero-length
                        [Expert Info (Note/Undecoded): Integral value is zero-length]
                            [Integral value is zero-length]
                            [Severity level: Note]
                            [Group: Undecoded]

Comparing the behavior on this packet with snmpbulkwalk it seems like snmp_exporter claims to be able to process this packet and does not error out, while snmpbulkwalk sees it as an issue and retries the fetch (which returns the same broken response) and eventually reports this as a timeout.

Here is the relevant output (IMO) of snmpbulkwalk -DALL:

trace: snmp_parse_var_op(): snmp.c, 164:
dumph_recv:                         Name
dumpx_recv:                          06 10 2B 06 01 04 01 81 68 81 25 02 03 02 01 04
01 05
dumpv_recv:                            ObjID: CPQPOWER-MIB::breakerPercentLoad.1.5
trace: snmp_pdu_parse(): snmp_api.c, 4515:
dumph_recv:                         Value
trace: _sess_process_packet(): snmp_api.c, 5249:
sess_process_packet: received message id#0 reqid#706927312 len 555
trace: _sess_process_packet(): snmp_api.c, 5252:
sess_process_packet: parse fail
trace: snmp_sess_select_info2_flags(): snmp_api.c, 6051:
sess_select: for all sessions: 3 (to in 2279843.432377 sec)
verbose:sess_select: timer due in 0.765869 sec
verbose:sess_select: setting timer to 0.765869 sec, clear block (was 1)
trace: snmp_synch_input(): snmp_client.c, 183:
snmp_synch: Response (ReqID: 706927312 - Cmd 165)
Timeout: No Response from 172.18.0.10

bluecmd avatar Jun 28 '20 07:06 bluecmd

https://github.com/rknall/wireshark/blob/aeab984510c521fbef9346aa604ea7e2b88dd41c/asn1/snmp/packet-snmp-template.c#L983 is where that's coming from, so this sounds like a bug upstream. Though I can also see how treating that as a 0 might be valid, I've not had to dig into this level of X509 detail on an SNMP issue before.

brian-brazil avatar Jun 28 '20 10:06 brian-brazil

https://github.com/soniah/gosnmp/blob/60a0d22031649a67c07b71d8ed989b135a5f9715/helper.go#L533 is the relevant bit of upstream code. Someone would need to dig into the SNMP/X509 spec to see if this is invalid, and then address it upstream.

brian-brazil avatar Jun 29 '20 11:06 brian-brazil

Am I correct in reading that you consider this only a device bug, and you're not concerned about the difference in operation between NetSNMP and snmp_exporter? I thought that would be the major concern -- mis-parsing invalid data and presenting it as correct.

Having casually browsed around SNMP BER encoding it definitely seems like zero-length encoding is not valid.

Anyway, this is the limit of how much time I will spend on the issue - I'll leave it to you authors of snmp_exporter to decide the next steps. I'm willing to provide pcap dumps if requested, but they contain potential sensitive data so it would have to be a request in that case.

bluecmd avatar Jun 29 '20 11:06 bluecmd

The device here has a bug anyway.

It's also possible that our upstream has a bug, can you point me to the relevant docs?

brian-brazil avatar Jun 29 '20 11:06 brian-brazil

Thanks, that looks fairly clear. I've filed an issue upstream.

brian-brazil avatar Jun 29 '20 12:06 brian-brazil

The upstream issue has been resolved. Once we bump the Go modules, we can cut a new release.

SuperQ avatar Oct 25 '21 14:10 SuperQ