snmp_exporter icon indicating copy to clipboard operation
snmp_exporter copied to clipboard

Pull specific interfaces with snmp.yml

Open freedomwarrior opened this issue 5 years ago • 61 comments

Hello. How can I pull only ethernetCsmacd interfaces from device? For example, I got network switch with around 1k l2vlan and I don`t need to pull them via snmp-exporter. How can I do that?

freedomwarrior avatar Jul 15 '19 14:07 freedomwarrior

I don't think there's a way to do that kind of filtering right now. But it's an interesting idea.

There have been some ideas on adding filtering based on OID index ranges, but those are also not implemented yet.

SuperQ avatar Jul 15 '19 14:07 SuperQ

You can specify specific oids though within a table.

brian-brazil avatar Jul 15 '19 16:07 brian-brazil

You can specify specific oids though within a table.

Can you show me some example?

freedomwarrior avatar Jul 16 '19 13:07 freedomwarrior

A good option will be ifType filter, for vlan interface: l2vlan, for ethernet interface: ethernetCsmacd

freedomwarrior avatar Jul 24 '19 06:07 freedomwarrior

Some solution: rfc1271, etherStatsDataSource

freedomwarrior avatar Aug 23 '19 13:08 freedomwarrior

Hello again. Any update?

freedomwarrior avatar Oct 28 '19 08:10 freedomwarrior

@freedomwarrior PRs welcome.

SuperQ avatar Oct 28 '19 10:10 SuperQ

For network devices, it would be great if we could filter based on interface description. For some networks that could mean the difference between polling half a million interfaces or 20 thousand interfaces.

orgito avatar Oct 28 '19 10:10 orgito

I'm not sure this is something that we should have in the snmp exporter itself, as it means we'd need a two pass walk. I think figuring out what OIDs you want offline and putting them in the config would be better.

brian-brazil avatar Dec 02 '19 12:12 brian-brazil

Per @SuperQ, we may want to be able to notice a bunch of gets for adjacent oids and use a walk instead for efficiency.

brian-brazil avatar Dec 02 '19 12:12 brian-brazil

The current method, thanks to #283 can now reference a specific index in the walk section of the generator.yml: - 1.3.6.1.2.1.31.1.1.1.6.40 # Instance of "ifHCInOctets" with index "40" I understand that the generator is being kept as simple as possible and would avoid two pass collection on each run. Here are two suggestions for a solution:

First method: Extend what was done in #283 to permit some kind of syntax with a list or range: 1.3.6.1.2.1.31.1.1.1.6.[40,41,42,45,199] # 40,41,42,45,199 being instances. But I think the yml config file will not like this. This puts all the hard work outside generator.yml.

Second method which would enhance all of snmp_exporters use-cases. Use a two pass system that will cache lookups for a period of time. This opens up all the features everyone needs when managing enterprise grade networks: Filter prior to collection based on ifName, ifSpeed, ifAdminStatus, etc. snmp_exporter is called to collect module A: Pass 1 do lookups, applying filtering to create collection subsets and cache them, pass 2 collect. snmp_exporter is once again called to collect A: only do collection of identified subsets. After X minutes/hours, when a collection is requested, run a new lookup and update the cache and collection subset. This makes the snmp collector stateful for the collected modules/target pairs, but will at best double performance as lookups are run less often and will wildly increase overall performance as only the required targets will be collected! The configuration will stay elegant and concise. More regex and filtering rules, but that makes it much much easier to maintain for users.

This is, to me, the most elegant solution, because otherwise snmp_exporter is only good for limited use cases. Currently I cannot see how anyone would be collecting hundreds or thousands of switches with varied collection profiles without maintaining hundreds of modules with a programmatic pre-generator, erck..

Also having a two pass framework would permit funky stuff like auto-detecting if interfaces should use HC or standard counters (less of a problem today with gig+ switches), downgrading collections to only if(HC)InOctets/if(HC)OutOctets for interfaces that are not actually physical interfaces (sub-interfaces, vlans, aggregates, virtual circuits, etc.) and plenty of other use-cases.

Cheers xkilian

xkilian avatar Feb 04 '20 05:02 xkilian

Use a two pass system that will cache lookups for a period of time.

This would not be correct, as interfaces can be added dynamically.

will at best double performance as lookups are run less often

This would also not be correct, as lookup values can change over time.

All this complexity is not something I plan on adding to the SNMP exporter. I'd ask instead why some vendors SNMP implementations can't produce data for everything in anything resembling a sane amount of time. From a Prometheus standpoint, you can do filtering like this once the data get into Prometheus.

brian-brazil avatar Feb 04 '20 08:02 brian-brazil

This would not be correct, as interfaces can be added dynamically.

Lookups would be done the the full set of interfaces, so any interfaces that are dynamically added or removed would use the current known lookup values and would be updated on the next interval. Typical use-case would be that the interfaces would be dynamically added at the next lookup refresh, as they are most likely matching the regex based on ifName, ifType, ifDescr, etc..

This would also not be correct, as lookup values can change over time.

Exactly right, but typically an SNMP monitoring system for switches and routers will not be very dynamic and can tolerate some delay of information. if an equipements is dynamic and critical, then it should use the current method with its drawbacks. But the critical devices that a network admin is reponsible for are typically, core routers, distribution routers, perimeter routers, these do not change often and having a couple hours delay between updates of the ifName or ifType, etc is a non issue. I have used(and still use) a system that did this very succesffully with great scalability, kudos to @titilambert, lookups where only done on monitoring system restarts which where done every few days/weeks and it still met the networking teams needs.

Network devices and other industrial hardware using snmp have always had poor implementations, but even a good implementation can get overwhelmed as SNMP trees have way too much information available for regular unfiltered polling. Some new devices now offer http APIs for metrics(IOS-XE and linux based NOSes) and some are considering streaming telemetry but that is still a long way off.

This type of minimal complexity corrects the current way of running lookups at every run which is a costly operation which greatly limits the exporters usability. I think this can fit in the Prometheus philosphy of not being 100% correct, but being more automated and flexible for the user. SNMP is a firend and a bane. ;-)

Cheers,

xkilian avatar Feb 04 '20 15:02 xkilian

If you don't care about fresh data, then figuring out which interfaces to poll out of band seems the better solution to me.

HTTP and streaming are distractions, with low network latency (as we don't have windowing) SNMP can transfer a lot of data in not much time.

On Tue 4 Feb 2020, 16:07 xkilian, [email protected] wrote:

This would not be correct, as interfaces can be added dynamically.

Lookups would be done the the full set of interfaces, so any interfaces that are dynamically added or removed would use the current known lookup values and would be updated on the next interval. Typical use-case would be that the interfaces would be dynamically added at the next lookup refresh, as they are most likely matching the regex based on ifName, ifType, ifDescr, etc..

This would also not be correct, as lookup values can change over time.

Exactly right, but typically an SNMP monitoring system for switches and routers will not be very dynamic and can tolerate some delay of information. if an equipements is dynamic and critical, then it should use the current method with its drawbacks. But the critical devices that a network admin is reponsible for are typically, core routers, distribution routers, perimeter routers, these do not change often and having a couple hours delay between updates of the ifName or ifType, etc is a non issue. I have used(and still use) a system that did this very succesffully with great scalability, kudos to @titilambert https://github.com/titilambert, lookups where only done on monitoring system restarts which where done every few days/weeks and it still met the networking teams needs.

Network devices and other industrial hardware using snmp have always had poor implementations, but even a good implementation can get overwhelmed as SNMP trees have way too much information available for regular unfiltered polling. Some new devices now offer http APIs for metrics(IOS-XE and linux based NOSes) and some are considering streaming telemetry but that is still a long way off.

This type of minimal complexity corrects the current way of running lookups at every run which is a costly operation which greatly limits the exporters usability. I think this can fit in the Prometheus philosphy of not being 100% correct, but being more automated and flexible for the user. SNMP is a firend and a bane. ;-)

Cheers,

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/prometheus/snmp_exporter/issues/432?email_source=notifications&email_token=ABWJG5WQZXTFZHQ442RACI3RBGADVA5CNFSM4IDXIOYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKX6R2Q#issuecomment-581953770, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWJG5SZIXLEF2N4KOM3SH3RBGADVANCNFSM4IDXIOYA .

brian-brazil avatar Feb 04 '20 15:02 brian-brazil

Polling data is always fresh. It is only the meta-data/labels that is less fresh. This also means that adding/removing dynamically interfaces based on regexes is only applied when the lookup is done. Users can adjust to their reality (60 minutes, 24hours, etc. for refresh time of the lookups/indexes. Even users that need to monitor all ports of a switch, can simply match against something that will collect all expected ports, this way all ports would be monitored, but only the meta-data would lag a bit.

- 1.3.6.1.4.255.1.2 # Some table OID

  match ifName # Match some index or lookup label

   regex  Monitored # Collect interfaces that contain the name Monitored`

I would rather have a regex rule to that can apply to any number of use-cases, instead of having to develop management system to connect to the switches (regurlarly), figure out what to collect, build the generator.yml and rerun the generator, restart snmp-exporter.

There is work to be done, nothing is free, but I think it would be a win win for everyone in the enterprise space. Please seriously think about this from an acceptability point of view.

xkilian avatar Feb 04 '20 19:02 xkilian

I would like to allocate ressources to implement this feature. As this will involve material investment, I would like to validate that if such a change is done, it can be integrated. Obviously coding standards would be respected. Documentation and a specification for the work to be done would be provided.

xkilian avatar Mar 19 '20 15:03 xkilian

I don't believe this currently belongs in the snmp exporter, but if you write a tool to auto-generate the config in this fashion we can link it from the readme.

brian-brazil avatar Mar 19 '20 15:03 brian-brazil

  1. It cannot be done in the current generator code base, as currently we have to maintain a dozen different generators with their own associated mibs.
  2. Even if an outside tool or the existing generator would auto-generate the snmp.yml data, the file would be huge as it would have to explicitely detail ALL the specific instances to collect for each and every device (think for thousands of devices for a typical big network) This would also mean re-implementing all the logic of the generator to create the snmp.yml in the first place which is a non-trivial and useless task in this context.
  3. I have trouble understanding why you do not want to improve to the collector?
  • In my humble opinion, it does belong in the snmp-exporter or a sister process. As it has access to all the data it needs, (authentication, fitering regexes, index data). Keeping a cache of index items is not something that will be costly in processing. Updating the indexes will require some local processing, but will fix the current breaking problem that the collecter does not work with network devices without serious impact to collection time and ressources usage on the destination device.

  • I am open to suggestions. ex. Putting a seperate process that is responsible to collect indexes apply the filters and update the local cache with the new information could be an option. The issue becomes how the two processes communicate. It could be a local redis or inter-process communications. What I want to avoid is having to modify in any way the snmp.yml.

  • The current implementation is a clear no-go for anyone with an enterprise network to monitor. Which is why the uptake of this module has been anemic and restricted to home, small office use or devices that have very little data in their tables.

xkilian avatar Mar 19 '20 16:03 xkilian

the file would be huge as it would have to explicitely detail ALL the specific instances to collect for each and every device

I don't see how that's a problem, disk is cheap.

This would also mean re-implementing all the logic of the generator to create the snmp.yml in the first place which is a non-trivial and useless task in this context.

There's no need to re-implement the generator, you can run it.

I have trouble understanding why you do not want to improve to the collector?

As indicated above, a cache could produce incorrect results and this would require adding business logic into the system which is not something we presently have.

The current implementation is a clear no-go for anyone with an enterprise network to monitor. Which is why the uptake of this module has been anemic and restricted to home, small office use or devices that have very little data in their tables.

That's an exaggeration, and there's many exiting enterprise users. The problem here are devices which are not fit for purpose in terms of their SNMP performance, it's not to do with data volumes per-se.

If you want to dynamically adjust what the exporter scrapes, you'll need to dynamically generate the snmp.yml.

brian-brazil avatar Mar 19 '20 17:03 brian-brazil

The main take away here is that you do not want to maintain additional general code in the snmp collection engine. At the cost of pushing major complexity out beyond the generator.

I firmly that snmp-exporter can be improved without sacrificing maintainability, and without putting any business logic in the snmp-exporter code.

I agree that snmp implementations are variable in their quality, but that is the nature of the products. Wishing it was different will not make their collection any more possible. I do not believe it is an exageration, collecting snmp data every minute on enterprise grade 500+ varied network devices (routers and switches) would be a clear exception.

I would love other networking people to pipe up and provide feedback on their experience and what their pain points are with snmp-exporter. This would be a great BoF for a monitorama or PromCon!

Please do not misunderstand what I am saying concerning a cache. The cache is the selection of which indexes to collect. Not the actual data being collected. So the collected data is always accurate. The difference is that a configuration change on an end device may not result in an indexe being removed or added to the list of things collected. I would certainly not want a system that collects inaccurate data.

Wish you all the best in these uncertain times.

xkilian avatar Mar 25 '20 01:03 xkilian

I agree with @xkilian. The current walk code needs to be more flexible for which ranges to walk. There are many devices that expose large indexes of virtual devices that need to be skipped. Not just for the Prometheus-side, but also to reduce the device side load.

SuperQ avatar Mar 25 '20 07:03 SuperQ

The main take away here is that you do not want to maintain additional general code in the snmp collection engine.

That's part of it, but primarily from an architecture standpoint I don't think this belongs within the snmp exporter/generator at all.

At the cost of pushing major complexity out beyond the generator.

I think you're overestimating how hard this is to do on your end.

without putting any business logic in the snmp-exporter code.

That's not possible, choosing which oids to hit is business logic as it depends on your deployment rather than on the device itself.

The cache is the selection of which indexes to collect. Not the actual data being collected. ... I would certainly not want a system that collects inaccurate data.

If something is missing that should be present, that's inaccurate data. If something is present that should be missing, that's inaccurate data. If something is correctly present but has incorrect labels, that's inaccurate data

Metrics aren't just about the numbers, existence and labels also matter. Caching is not safe.

The current walk code needs to be more flexible for which ranges to walk.

Users can already specify to only walk specific oids within a table. How they determine those oids is up to them.

brian-brazil avatar Mar 25 '20 07:03 brian-brazil

Users can already specify to only walk specific oids within a table. How they determine those oids is up to them.

The current method for this is not good. It switches from walk to get, meaning you drop the ability to bulk fetch data. Having range control over the walk/bulkwalk is something we should support.

SuperQ avatar Mar 25 '20 08:03 SuperQ

It's a bulkget, so it's as efficient as a bulkwalk.

brian-brazil avatar Mar 25 '20 08:03 brian-brazil

Hi @SuperQ , @xkilian ,

We're working on a stuff similar to your case I believe. Our main idea is to create our own "generator" projects. Our solutions relies on adding "some constant label" on a metric only. https://github.com/prometheus/snmp_exporter/pull/497

This way we're aiming to "get" only a subset of interfaces and thereby decrease the load on the Network Element side. We might have ten-thousands of interfaces on some specific Router's BUT only interested in hundreds of those. And definitely we don't want to WALK all SNMP table for this.

Is your case similar? How did you solve this issue at your side?

thanks..

karanlik avatar Apr 06 '20 10:04 karanlik

That's something separate to this issue, let's not confuse things.

brian-brazil avatar Apr 06 '20 11:04 brian-brazil

@karanlik Solving this in a generator is a false good idea without greatly changing the way it interacts with snmp_exporter, IMO. I believe I have described why it is not a good idea at scale.

This way we're aiming to "get" only a subset of interfaces and thereby decrease the load on the Network Element side. We might have ten-thousands of interfaces on some specific Router's BUT only interested in hundreds of those. And definitely we don't want to WALK all SNMP table for this.

This is one of the core issues we aim to solve to achieve at scale snmp monitoring.

The correct solution is somewhere in the middle where the generator.yml/generator plays its role, but the key magic happens in the snmp_exporter. Brian and I agree to disagree on some issues, but hopefully we can eventually get things all squared up. To that end, we will be forking snmp_exporter to provide a solution that will permit post-indexing, but pre-collection filtering of what to collect. It should also permit configurable indexing recurrance, so as not to WALK the indexes at each collection run and more importantly NOT fully walk every configured branch. The default would be the current way of doing things. This way we can have a debate with actual metrics. I have confidence in the how and why it can be done a certain way to achieve more scalability/flexibility, that also retains the current configuration elegance while staying easy to maintain and troubleshoot from a code perspective. We have the scalability requirement that is not being met, so we intend to contribute a solution and in the process propose an improvement an amazing tool!

xkilian avatar Apr 21 '20 04:04 xkilian

@SuperQ It is not the how to get, as Brian mentioned, the tool is based on netsnmp(which is super efficient) and uses the correct methods of getting the data. It is an issue of asking for too much data from the end devices. ;-)

xkilian avatar Apr 21 '20 04:04 xkilian

Thanks @xkilian. Yes, that's what I was trying to say, but not saying it correctly. Reducing the size of the device fetches for specific cases would go a long way to improve things for some users of the snmp_exporter. Looking forward to your PRs.

SuperQ avatar Apr 21 '20 07:04 SuperQ

I think you misread, @xkilian appears to intend to fork as I've indicated I'm not willing to accept a solution that a) adds business logic and b) produces inaccurate results into the snmp exporter. I personally believe this is best handled by 3rd party tools that produce a new generator.yml periodically (thus clearly isolating the business logic and correctness issues), rather than trying to wedge this into existing binaries.

brian-brazil avatar Apr 21 '20 08:04 brian-brazil