vector
vector copied to clipboard
Add additional Prometheus logging data for DNSTAP TCP sources
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Use Cases
With the addition of TCP sources in DNSTAP, it becomes possible for a DNSTAP source to have many inputs, using the same socket.
It is difficult to determine if there are failures or unusual conditions on the ingest side of this socket if there is no method by which to distinguish any of these different TCP sources from each other in the Prometheus output of the metrics scrapes. If a DNSTAP data stream is non-operational entirely from launch, or stops working during normal operation, it is not possible to determine that programmatically except by guessing.
For instance, this is what appears in the metrics when I do this:
curl -s http://127.0.0.1:9598/metrics | grep main_dnstap_1|grep tcp
vector_component_received_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",protocol="tcp"} 766489648 1710026543050
vector_component_received_event_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp"} 4289706405 1710026543050
vector_component_received_events_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp"} 3429329 1710026543050
I have two DNSTAP sources (different IP addresses; different DNS systems) sending data to main_dnstap_1's TCP socket. But it is not possible now for me to distinguish between those two sources - all of the data is lumped into a single set of tags.
Attempted Solutions
Workarounds: I could create a new source for every possible origin, binding to a different port number on the Vector server. This is a brute force hack, and does not scale well across hundreds of locations, as a map would need to be kept for which origin would map to which port which (based on experience) would eventually break in a difficult-to-diagnose way, and that is a kludge we're hoping to avoid with Vector. Consistency in configuration is ideal.
Proposal
It would be ideal to have another IP-origin-specific tag added to metrics that are invoked by events delivered over a TCP session that can have multiple origins on the same socket. I would imagine something like this:
vector_component_received_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",protocol="tcp",ip_origin="10.10.1.13"} 766489648 1710026543050
vector_component_received_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",protocol="tcp",ip_origin="10.10.1.50"} 3384922 1710026543050
vector_component_received_event_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp",ip_origin="10.10.1.13"} 4289706405 1710026543050
vector_component_received_event_bytes_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp",ip_origin="10.10.1.50"} 843272 1710026543050
vector_component_received_events_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp",ip_origin="10.10.1.13"} 3429329 1710026543050
vector_component_received_events_total{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab",mode="tcp",ip_origin="10.10.1.50"} 123774 1710026543050
For brevity, I only provided one example metric set, but anywhere in the DNSTAP source that could reference a discrete TCP session with different counters may need to be split out into distinct metric lines with unique sets of tags.
From a Prometheus perspective, this is entirely normal and natural and does not create a problem with the cardinality of tags. Summaries are almost always done using "ignoring" keystrings, or "by" keystrings, which can easily ignore or specify tag names for inclusion/exclusion in returned data.
However... this may break presumptions of how sources are understood and instrumented currently, and this is something I don't understand well enough about how the maintainers wish to see metrics implemented. If it is the case that adding new instances with new tag cardinality is not the best way to do it, then perhaps just adding a new set of metrics that is specific to the DNSTAP source could be created, if there were seen to be "mode: tcp" entries in the table. This is sort of what the HTTP data seems to do, I think.
Regardless of the model, I was hoping to be able to monitor Vector such that it I could collect/alert on this data in some form:
- for each ip_origin connection endpoint: (new entries created at first successful connection by an origin, and would remain until Vector is restarted)
- number of total TCP connections successfully established to Vector
- number of seconds since last successful connection (0 if never)
- status of TCP connection (1 = connected, 0 = not connected)
- number of events received from this endpoint
- number of bytes received from this endpoint
One flaw in my demonstrated model is that if there are two streams of data coming from the same origin_ip, then they would be lumped together. I see no easy way around this, since including port number would create un-necessary cardinality (each new connection by any of the instances would create a new origin-side port number, which would be confusing and inconsistent.)
POSSIBLY RELATED, POSSIBLY UNRELATED: Also, it may be related to this, but I see these metrics:
# HELP vector_open_connections open_connections
# TYPE vector_open_connections gauge
vector_open_connections{component_id="main_dnstap_1",component_kind="source",component_type="dnstap",host="dev01.lab"} 0 1710025555050
...and those are incorrect. I have two connections open to dnstap on the TCP socket that is connected to the source named main_dnstap_1. I am perhaps mis-understanding what this metric indicates.
References
No response
Version
vector 0.37.0 (x86_64-unknown-linux-gnu 3a495e3 2024-03-08 04:01:44.382953501)
Thanks @johnhtodd . I can the usefulness of those metrics and the ip_origin (I think we have some precedent for calling this source_ip in other places) tag for tracking the input streams separately. I think I'd like to see the the ip_origin tag be opt-in since it can potentially have a high cardinality though. I'd be ok with seeing that added to either just the dnstap source or to the socket source, more generally.
source_ip as the name would be just fine to match precedent. I think having it as an optional cardinality expansion flag is also fine, and I would think that having it associated with the socket concept would be a better way to do it so that can be applied to any other source.
In theory, it would also be able to be applied to sinks, as well, as it is possible to have sinks with wildcards, or even just a single sink destination with no wildcarding would probably benefit from the operator having an understanding how often the TCP sessions are being reset, or if a socket is open or not. This perhaps may be too ambitious for this iteration of the concept.
I'm sifting through older feature requests, and this one stood out to me today. Having this data (stats per DNSTAP IP origin) would be useful to find misbehaving producers. If this was implemented generically for any source, it would probably be quite useful overall to have Vector become a validation point for upstream data in a way that wasn't hidden "inside" the data itself.