pdns icon indicating copy to clipboard operation
pdns copied to clipboard

Feature request: counters/telemetry on telemetry

Open johnhtodd opened this issue 2 years ago • 0 comments

  • Program: Authoritative, Recursor, dnsdist
  • Issue type: Feature request

Short description

Create logging metrics that show status and counters on telemetry consumers

Usecase

We send a large number of dnstap messages to nearby consumers. We don't know when those consumers are overwhelmed, or if they are overwhelmed, how many messages we've lost. Also, sometimes we see flapping behaviors due to restarts or faults on the consumer side. It would be useful to know what is happening on logging consumers from the perspective of the logging origin, and this would apply to all packages: rec, auth, and dnsdist (though we are most interested in dnsdist) If there was some way to subtract successful messages from possible transmitted messages, then we could see when we are hitting faults if the socket was open and not able to consume fast enough. If we knew the status of the socket, then we could tell the difference between times when the that delta was increasing due to simple network failure versus if the delta was increasing due to inability to consume the data fast enough.

Description

Counters that collect and make available the following information on each set of IP:port consumers (maybe make those tags in the Prometheus format of the output in /metrics ?)

  • number of messages sent to each telemetry consumer. This counter would increment even if the consumer was unavailable, not connected, not able to ACK previous messages fast enough, etc. - it is a counter of "possible" messages that the consumer could have received.
  • number of messages actually transmitted to the consumer. This would count how many messages were successfully (at least, from the TCP definition of "success") transmitted to the consumer. If the consumer is not fast enough to accept messages, and messages are dropped, those dropped messages would not increment this counter.
  • number of times the telemetry socket had been opened. Every time the session achieves a successful three-way handshake, this would increment.
  • current status of TCP session - 0 for not fully established, 1 for fully established (I don't think it's necessary to track half-opened sockets here, as that seems to be extra work that is not really helpful)

https://drive.google.com/file/d/1YgplffRgtaiksqi-Gg2Gxw0gTR9XkDGs/view?usp=sharing

johnhtodd avatar Aug 09 '22 17:08 johnhtodd

I'd like to tackle this. We have two ways of sending out telemetry: our own (proprietary) RemoteLogger and a class used to send dnstap messages via libfstrm: FrameStreamLogger. Bot classes already keep some metrics, they are just not exposed and not uniform. It would best to make that collection uniform and provide a common interface to retrieve the metrics. That interface can then be used by rec or dnsdist. The amount of connections made would be hard to generalise: libfstrm does not expose any internal API for that. Its objective is too hide all the networking details from the user of the library.

But messages sent and dropped (for various reasons) should be easy.

omoerbeek avatar Aug 29 '22 08:08 omoerbeek

That's probably a good start. Making the two have uniform reporting characteristics would be useful, and if the number of attempted connections is difficult, then at least do what is possible.

You mention "rec or dnsdist" - is this also useful for auth? My particular interest does not extend to auth, but consistency is always good.

johnhtodd avatar Aug 29 '22 19:08 johnhtodd