snmp_exporter icon indicating copy to clipboard operation
snmp_exporter copied to clipboard

Request: Ability to disable bulkwalk on either the module or at scape time

Open BenB196 opened this issue 6 years ago • 27 comments

Host operating system:

N/A

snmp_exporter version:

Any

What device/snmpwalk OID are you using?

Cyberpower PDU

If this is a new device, please link to the MIB(s).

MIB(s)

What did you do that produced an error?

Try to monitor this device with SNMP exporter. SNMP exporter uses only bulkwalk. It fails because for some reason this device implements SNMP which doesn't support bulkwalk.

What did you expect to see?

It would be nice to have a way to tell SNMP exporter to perform a regular SNMP walk instead of a bulkwalk to support device which don't properly implement SNMP.

It would be nice to either specify at the module level to use walk instead of bulkwalk, or at scrape time pass something that would tell it use regular SNMP walk.

Note: While I understand that this is not really an issue with the SNMP exporter and instead with the device, it would make monitoring things with poor SNMP implementation easier.

BenB196 avatar Dec 12 '19 21:12 BenB196

You can already configure a module to use v1.

brian-brazil avatar Dec 12 '19 22:12 brian-brazil

@brian-brazil If I understand you correctly, you are saying switch to SNMPv1 correct?

If this is correct, it is slightly different then what I am requesting. The device itself """supports""" SNMPv3, it just doesn't support bulkwalk/bulkget. Which is why I am requesting the ability to not use bulkwalk/bulkget with SNMPv3.

If this is incorrect, can you point me to the documentation that would better explain what you are talking about?

BenB196 avatar Dec 13 '19 13:12 BenB196

I mean to use v1. If your device doesn't support bulkget/bulkwalk, then it is incorrect to state that it implements SNMPv2/3.

brian-brazil avatar Dec 13 '19 14:12 brian-brazil

While I understand that this issue revolves around an incorrect implementation of SNMPv3 on the device itself. I don't think that being forced to use SNMPv1 is the correct solution here as it reduces security. And in a zero trust network, not using SNMPv3 is extremely risky.

I still think that it would be nice to offer a way to use snmpwalk/snmpget via SNMPv2/v3.

BenB196 avatar Dec 13 '19 16:12 BenB196

I'm not sure we should be adding support for a single device which clearly violates the RFCs, particularly when using SNMPv1 readonly, restricted to source IP, and (I presume) only on an management network is an option.

brian-brazil avatar Dec 13 '19 16:12 brian-brazil

SNMPv3 security isn't really worth anything anymore. The best auth security is SHA, which is considered broken by most people. You should very much NOT be exposing SNMP in a ZeroTrust way over the internet. SNMPv3 isn't anywhere near secure enough to be exposed publicly, and isn't part of a ZeroTrust design. That's not what ZeroTrust means.

That said, I spent a few minutes scanning the RFCs, trying to find where in the SNMP specs that says that Bulkwalk is a required feature. But it seems like it's not technically required. So maybe we should consider a flag to disable bulk walks in v2c/v3. :-/

SuperQ avatar Dec 13 '19 16:12 SuperQ

One other sad thing about the CyberPower devices (I have one), is that they only support authNoPriv, which means that data will not be encrypted over the wire even with SNMPv3. :joy:

SuperQ avatar Dec 13 '19 17:12 SuperQ

We did actually find where the RFCs say that BulkGet is mandatory for v3, 1905#4.2 combined with 2570#6.3.

brian-brazil avatar Dec 13 '19 17:12 brian-brazil

So we found the relevant section of the RFCs. RFC 1905 does state:

   It is mandatory that all SNMPv2 entities acting in a manager role be
   able to generate the following PDU types: GetRequest-PDU,
   GetNextRequest-PDU, GetBulkRequest-PDU, SetRequest-PDU,
   InformRequest-PDU, and Response-PDU; further, all such
   implementations must be able to receive the following PDU types:
   Response-PDU, SNMPv2-Trap-PDU,

But, I'm still thinking we should have a "don't use bulk" flag in the config to work around broken devices. There's just too much crap SNMP out there for us to be RFC strict.

SuperQ avatar Dec 13 '19 17:12 SuperQ

While I don't expose the PDU over the internet, there are about 150 of these PDUs that I need to monitor at remote sites over vpn, and network segmentation at all of these sites is not the best, which is why I consider it a "ZeroTrust" environment.

BenB196 avatar Dec 13 '19 17:12 BenB196

That's not what "Zero Trust" means.

SuperQ avatar Dec 13 '19 17:12 SuperQ

You are correct, I was using the term loosely to describe a network which I cannot trust to be 100% secure. Which I guess is not an accurate usage of the term.

BenB196 avatar Dec 13 '19 17:12 BenB196

"untrusted network" is what you probably meant then.

brian-brazil avatar Dec 13 '19 17:12 brian-brazil

But, I'm still thinking we should have a "don't use bulk" flag in the config to work around broken devices. There's just too much crap SNMP out there for us to be RFC strict.

I'd like to vote on this feature request. We have several equipment with weak processors and we rely on 64 bit counters. They're so weak that the CLI will hang or become extremely sluggish when it's being scraped. For reasons beyond my understanding, I can run 2 concurrent "snmpwalk" (no bulk) and dump the nearly 400k OIDs under 1.3.6.1.4.1, and you notice nothing in the CLI, which led me to investigate why some time ago, until I figured out snmp-exporter was using bulkwalk and I was not.

I've considered hacking a small python script to read compatible config from the generated .yaml, but do it differently, but I was hoping to see snmp-exporter evolve in this matter. [edited] (It'd take ages for me to do this in Go)

I pretty much share @SuperQ 's quoted opinion on this: there's just too much too much stuff around, being RFC strict is not helping.

ntavares avatar Feb 28 '20 13:02 ntavares

I'm not sure what your argument has to do with this issue or being RFC strict. If you want to use only getnext then specify v1.

brian-brazil avatar Feb 28 '20 14:02 brian-brazil

@brian-brazil for a second I thought I've posted my comment on the wrong issue... isn't this about

to have a way to tell SNMP exporter to perform a regular SNMP walk instead of a bulkwalk?

In regards to being RFC strict, I was just picking up on @SuperQ's quote. We can argue that (quoting you)

If your device doesn't support bulkget/bulkwalk, then it is incorrect to state that it implements SNMPv2/3.

but, for the matter at hand, is irrelevant. The fact is that for one reason or another (like mine) @BenB196 cannot use the exporter. I can, at the cost of usability which, in our case, is indeed disabled where it hurts more.

In short, I also

don't think that being forced to use SNMPv1 is the correct solution

I suppose if we get a clear "not going to happen" here, we'll have to think of something else, otherwise I kind of keep hoping that someone with Go skills hacks an alternative "walk" (instead of that "bulkwalk").

ntavares avatar Feb 28 '20 15:02 ntavares

isn't this about

This issue was opened about a device which supports auth mechanisms for v3, but doesn't support v3.

Your situation is about performance, and you haven't mentioned anything that would indicate that v1 would be a problem for you. The difference between v1 and v2 is bulkwalk.

brian-brazil avatar Feb 28 '20 15:02 brian-brazil

Quoting the original request:

Request: Ability to disable bulkwalk on either the module or at scape time [...] What did you expect to see? [...] a way to tell SNMP exporter to perform a regular SNMP walk instead of a bulkwalk [...] [...] specify at the module level to use walk instead of bulkwalk, or [...] use regular SNMP walk.

I'm pretty sure this is what this issue is about. The OP reiterates later, in regards to the mention about v1:

[...] saying switch to SNMPv1 [...] is slightly different then what I am requesting

I think it can't get more clear than this.

As one of the replies was:

I'm not sure we should be adding support for a single device which clearly violates the RFCs

I wanted to contribute with another example why GetNext is still a desirable feature when using v2/v3 (in my case it's performance-related indeed, in the OP's case is lack of proper bulk support on the device).

That last quote is what I was refering to with "being (rfc) strict", btw.

and you haven't mentioned anything that would indicate that v1 would be a problem for you

If I wanted/could use v1, then this request would not be relevant to me - although I appreciate the suggestion.

The difference between v1 and v2 is bulkwalk.

If I understand correctly, the GetNext code is actually there already. Is there any technical reason why not to support it for v2 and v3?

ntavares avatar Feb 29 '20 13:02 ntavares

If I wanted/could use v1, then this request would not be relevant to me - although I appreciate the suggestion.

Could you explain why you don't want to/can't use v1? I'm not a fan of adding code to deal with poorly designed devices when there's an easy workaround.

brian-brazil avatar Feb 29 '20 14:02 brian-brazil

I'm not a fan of adding code to deal with poorly designed devices when there's an easy workaround.

Is this the answer to my "why not" question?

Assuming so, as Ben already mentioned, we're all presented with too many of these "poorly designed devices"; most of the time, "being (rfc) strict" in these cases turns out to be more of a dogmatic/academic barrier preventing users from being cough happy (read: to accomplish anything at all)... look e.g. at the time we're using to discuss such a simple thing (enabling already existing code on a different codepath). I'm not trying to underestimate the amount of work that has to be done to make this happen, however this is my only way of pushing this forward at the moment, as I have zero Go skills.

We want to use the v3 security mechanisms. And IIRC we'd be advised to use > v1 for 64bit counters, although I suspect that there could be some non "RFC-strict" hacks on both sides for that one.

ntavares avatar Feb 29 '20 14:02 ntavares

(read: to accomplish anything at all)

I'm not stopping you from accomplishing anything at all. Users are not entitled to a feature merely because they're not willing to try the obvious workaround.

And IIRC we'd be advised to use > v1 for 64bit counters, although I suspect that there could be some non "RFC-strict" hacks on both sides for that one.

I'd expect that to work either way. Did you test it?

brian-brazil avatar Feb 29 '20 15:02 brian-brazil

Also it occurs to me, did you try tuning max_repetitions?

Attempting to talk 400k oids with bulk walk is going to be difficult, without bulk walk it is likely to be impossible as you'd need no more than a 0.3ms average response time for a 2m scrape.

brian-brazil avatar Feb 29 '20 15:02 brian-brazil

did you try tuning max_repetitions?

Yes, we tried with non-default max-repetitions (several values, including 1). It did improve a bit, but the sluggish CLI is noticeably still there. It just doesn't solve the problem as GetNext does.

Note: we're not walking 400k OIDs, I just used that as a (tested) example that we could do that on these devices without noticing impact in the CLI. It did take a while, and I did use walk (snmpwalk), as otherwise the device would be brought to his knees that whole time.

ntavares avatar Feb 29 '20 20:02 ntavares

If you'd like a change to be considered, please present your full use case. That includes scrape frequency, oids touched, the device model, and why it's not possible for you to use v1.

brian-brazil avatar Feb 29 '20 20:02 brian-brazil

++ for that change request

Currently I face an issue in a device which seems to have bad support on bulk implementation (assumption). V1 / V2 is not possible as of security reasons, so not a usable workaround. At least for proving the culprit of the error a limitation to snmpwalk only would be a big help.

Andy1616 avatar Oct 13 '21 15:10 Andy1616

is it possible to have two modules? one for v1 and the other for v2 or v3. I thought only one was possible. Does going to v1 mean all scaping has to be a v1? Can you help with an example? As of now I have Brother 5450 and HL5100 that work just fine with v2 but for some reason the Brother HL6200 only works wutn snmpwalk but not bulk nd therefor does not work with v2 snmp_exporter

safadig avatar Jan 10 '22 03:01 safadig

If you'd like a change to be considered, please present your full use case. That includes scrape frequency, oids touched, the device model, and why it's not possible for you to use v1.

v1 is not possible to use with any network equipment having >1gb ports. IfHC* can be read only with v2. But many devices have buggy bulk implementation that leads to High CPU usage.

mirackle-spb avatar Jul 26 '22 12:07 mirackle-spb