snmp_exporter icon indicating copy to clipboard operation
snmp_exporter copied to clipboard

support index lookup caching

Open jelmd opened this issue 3 years ago • 10 comments

One of the stranges of the Prometheus exposition format is to deliver human readable metrics out of the box. To accomplish this with snmp_exporter as well, it usually needs to read slow lookup tables (~ 100 +- 50 ms/table), which makes a poll rate of 0.3..1 Hz or higher impossible and imposes a more or less high load on the related device (and snmp_exporter as well). Very inefficient.

Much better would be if snmp_exporter would cache lookup tables. Because such tables are stable at least wrt. the next boot of the device (often even dyn. enumerated sensors keep their index number over reboots), one may cache it. E.g. the lookup: descrTable could be modified to lookup: checkVal@descrTable. checkVal is the snmp_variable to check, whether the cached descrTable is still valid - could be e.g. sysUptime. If the lookup value has no .*@ prefix, snmp_exporter should work as usual, i.e. do not cache the relate table.

jelmd avatar Apr 13 '22 05:04 jelmd

Hello, this is a largely solved problem with the pull request branch: https://github.com/prometheus/snmp_exporter/pull/624 Been running this for 1.5 years with many types of networking and non-networking gear in multiple large enterprise environments with success. In one instance a single poller is getting 40K metrics from 900 networking devices(thousands more metrics from 500 non networking stuff) every minute with 5% CPU load on a 4x Core VM, no load or problems seen on destination devices. This is possible with Index filtering and with index lookup. Every run (every minute in my case) query will issue a single lookup per index, all other metrics using the same index be filtered to only collect the data we are interested in (ex. Only collect uplinks, only collect interfaces named X/Y/Z.. The project snmp_exporter project team has reviewed the pull request, but finalizing the review seems to be jammed.

xkilian avatar Apr 13 '22 13:04 xkilian

There was/is a lot of resistance to index caching (for longer periods), as it is not the true picture of data ALL the time. The above seems like the compromise that was acceptable to the current team. I agree with you that indexing could be cached for a configurable period, but hey, at least now with index filtering it IS possible to monitor mega large devices to collect only relevant data.

xkilian avatar Apr 13 '22 13:04 xkilian

See issue #432

xkilian avatar Apr 13 '22 13:04 xkilian

@xkilian Hi, thanx for your info - looks interesting. But AFAICS #624 just filters out, what not to get. Can't spot, where caching actually happens. Can you give a hint?

jelmd avatar Apr 13 '22 20:04 jelmd

Sorry, no caching. I corrected the post and explained in the later comment.

xkilian avatar Apr 13 '22 22:04 xkilian

@xkilianAh, ok, so actually #624 solves nothing wrt. caching. It is kind of orthogonal to the problem of this issue.

OT: Not sure, whether filtering out any OIDs by turning bulkwalks into single gets would help at all - what I've seen so far is that for most cases pulling in the whole table is much faster, than pulling in single values of a table. But we work usually on layer 2 - so tables have usually 20..100 entries, only. Furthermore we use handcrafted generator files and thus generator doesn't pull in unneeded OIDs, does its job as expected.

However, it pulls in e.g. ifName and entPhysicalName tables each time, which is a little bit brain damaged (and as said slows down things considerably, factor 10+). But to emit proper metrics, snmp_exporter needs them.

FWIW: The argument wrt. cache is ridiculous: Especially wrt. SNMP (and monitoring in general) we always have to deal with outdated and thus most likely incorrect data (have SNMP data timestamps attached?, Schrödinger's cat, etc.). So IMHO the user should decide, what is acceptable - not a tool, which knows nothing about the device being monitored. Last but not least, if a designer of a tool thinks, it might occasionally produce incorrect data, it should be enhanced to mitigate the problem: Wrt. snmp_exporter this is e.g. to check a certain value like sysUptime (because it is safe - documented by the vendor, that it stays stable after boot) or "simply" enable it to receive related notifications from the device, when "something" has been changed ;-) .

jelmd avatar Apr 14 '22 09:04 jelmd

BTW: As a workaround https://github.com/jelmd/snmp_exporter supports static label/value mapping ...

jelmd avatar May 28 '22 04:05 jelmd

What if we implemented a more general-use walk/get cache?

Something like this:

if_mib:
  walk:
  - oid: 1.3.6.1.2.1.31.1.1.1.18 # ifAlias
    ttl: 1h
  - oid: 1.3.6.1.2.1.31.1.1.1.6 # ifHCInOctets
    ttl: 1m

This would allow the exporter to cache the lookup table values, as well as cache actual values for reduced walks when Prometheus is in HA mode.

Of course, this would make things like scaling of the snmp_exporter harder because you will have non-consistent caching when snmp_exporter is clustered. (Very bad things like counters jumping backwards would happen) But we could document these issues.

SuperQ avatar May 28 '22 07:05 SuperQ

I can't say that I love this, but I can see why it is useful in some situations.

It needs a per-OID reset stanza, e.g. on uptime, IMO.

It also needs a way to detect that caching is happening, i.e. an info series or a gauge for amount of cached OIDs and series per crape

Sent by mobile; please excuse my brevity.

On Sat, May 28, 2022, 09:08 Ben Kochie @.***> wrote:

What if we implemented a more general-use walk/get cache?

Something like this:

if_mib: walk:

  • oid: 1.3.6.1.2.1.31.1.1.1.18 # ifAlias ttl: 1h
  • oid: 1.3.6.1.2.1.31.1.1.1.6 # ifHCInOctets ttl: 1m

This would allow the exporter to cache the lookup table values, as well as cache actual values for reduced walks when Prometheus is in HA mode.

Of course, this would make things like scaling of the snmp_exporter harder because you will have non-consistent caching when snmp_exporter is clustered. (Very bad things like counters jumping backwards would happen) But we could document these issues.

— Reply to this email directly, view it on GitHub https://github.com/prometheus/snmp_exporter/issues/744#issuecomment-1140190993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFYIIZMS7C5SRDT2JYOST3VMHA6TANCNFSM5TJQ6D5Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RichiH avatar May 28 '22 10:05 RichiH

The ttl property looks good. However, I think as a workaround I would setup a 2nd scrape target and just query the static targets in bigger intervals. The real win is if the index lookups get cached. So wrt. generator config it should be optional for walk items as well as for lookups, which finally end up in the exporter config as walk/get item options, yepp. Can't tell anything about HA mode: we just pull once and push data to independent DBs as needed.

jelmd avatar Jun 21 '22 22:06 jelmd