go-dnscollector icon indicating copy to clipboard operation
go-dnscollector copied to clipboard

Feature request: add sampling rate to downsampled items

Open johnhtodd opened this issue 1 year ago • 6 comments

The downsample method in the transformer is quite useful, but creates problems for upstream data modeling when expansion is desired. Can there be a downsample rate added to each item, and possibly revealed in the various metrics that may have downsampled data contained within them?

There are instances where it would be useful to understand what the computed rate of samples would have been if the downsample rate had been expanded. This is easier to show via example than to explain:

If we are sampling 1 out of every 20 packets and sending them on to an upstream logger (dnstap, or kafka, or whatever) then when we perform a computation on that result at a later time, we will need to multiply every query by 20 to get an estimated result of the actual numbers of events that match that packet. But where did we get the number '20' in our downstream system? Somehow, we need to have the ability to automatically multiply each of our downsampled elements by 20 to get realistic numbers.

Of course, expansion like this is always an estimate, and may have statistical problems, but that is out of scope for this discussion & feature request - that is a larger issue created with any statistical sample. What is missing in go-dnscollector's model is some way to transmit the rate at which the sample was taken, so that in the future we can perform this expansion for this specific sample. This is especially concerning when there may be many different sampling rates for different items in the network - some resolver/go-dnscollector instances in our network may be sampling at 1:20, but others cities at 1:10 or some at 1:1. Perhaps we also have specific domains that we want 100% of the data. If we cannot connect the sample rate to the specific sample itself, then it is impossible to compare any two potentially different sample-rate systems to each other at a later time without somehow understanding (through out-of-band knowledge) the sample rate for that particular location, and/or that particular rule, and then doing a very complex lookup and expansion math to represent the data.

There already exists the concept of a traffic reducer, where a number of repeating DNS objects can be summarized into a single sample with an expansion counter. I might suggest that this problem of sample expansion be solved in the same way - a "sample rate" identifier may be inserted into the object that identifies the object as something taken as a sample, and that an expansion should be performed at the time of examination to get a result that is imprecise but at least approximates what was lost in the downsampling process. Like this, perhaps, for a 5% sample rate:

{ "sample-rate": { "rate": 20,} }

Another notable concept here (that is already a problem with the current model?) is that there will may be different sample-rate results on the same server - this doesn't just apply to the "transforms: -> filtering: -> downsample:" setting. If a particular domain (keep-domain-file), RCODE, or other criteria are used in the traffic filtering configuration, then we probably want to log those at 1:1. But perhaps everything else that doesn't match those specific items of interest gets logged at 1:20 rate. So each filter would need to have a different rate applied to it, and then that rate would be communicated to the onward loggers or other tooling (& metrics?) that would collect that information so the correct expansion could be applied to them to truly understand the exact or estimated volumes at which each event class is occurring. The sample-rate would need to be attached to EVERY object that is sent onwards. A safe method for this might be that if any downsampling occurs anywhere in the configuration, that every reported object from that dnscollector instance then has a "sample-rate" attached to it.

This would imply also that any filtering rules that had a high sample rate would be parsed first, and then any lower sample rate rules would be skipped otherwise counts would happen twice and any 1:1 sample items potentially could skew the numbers for lower sample-rate filter classes. This puts the burden of understanding the math on the person writing the queries against the telemetry platform later, but there is no way I can see around that which is less dangerous or confusing.

For Prometheus and Influx results, this would probably also require a separate metric that showed the sample rate for each rule (metric? that's more difficult) that creates a result (downsampled or not ) - does it do this today? Or do metrics results get populated before downsampling?

This is a complex feature request, but is fundamentally tied to the ability of go-dnscollector to scale to larger deployments without linearly scaling with the bandwidth or CPU of the resolvers themselves. If we can ditch 95% of the telemetry traffic right at the edge of the network where it comes into the go-dnscollector agents, then our resource footprint for telemetry drops drastically. Some sites we want to have 5% sample rates, but some sites we want 20% sample rates (because of lower real volume) but for some domains we want 100% sample rates across all sites. Mixing those types of rate differences is really, really hard (impossible) without understanding on the individual DNS object level what the sample rate was when that object was collected. Doing this at a macro-level (just looking at the sample rate as it applies to "transforms: -> filtering: -> downsample:" in the traffic filtering configuration section would be useful, but seems insufficient. It does appear that being a bit more thorough with this model would allow some very interesting and flexible sample rate inclusions to be applied for different rulesets to allow easier computation of estimated traffic volumes deeper in the telemetry pipeline.

johnhtodd avatar Sep 29 '23 20:09 johnhtodd

Hi John

For the first point, adding the capability to transmit the rate should be relatively easy. As you said , the model need to be updated but it's a minor change. It can be added this on the next release

Regarding the second point, I need to make some tests but to accomplish what you said, it's possible to duplicate the incoming flow and apply multiple transforms to it (just through the configuration).

Something like that:

                    one collect 
                        |
                        v
    duplicate DNS objects to all upstream loggers throught the multiplexer
     |                                         |     
     v                                         v
 apply transforms:                           apply transforms: 
 -> filtering (keep-domain-file)             -> filtering (keep-domain-file)
 -> downsample (1:20):                       -> downsample (1:10)
     |                                         |
     v                                         v
 upstream                                   upstream
 logger                                     logger

My understanding is correct ?

I don't know if you are using containers but in docker mode It possible to chain in an easy way multiple DNS-collector on the same machine with dedicated config (and transforms) on each containers

dmachard avatar Sep 30 '23 08:09 dmachard

I think I was interested in how transformations can be chained (rather than simply forked with a complete set of transformations, as you show) but I am starting to think that requires the concepts of "rules" and boolean logic that forwards, duplicates, or stops a packet from moving to some different multiplexors. I think this concept needs more thought on my part.

johnhtodd avatar Nov 20 '23 04:11 johnhtodd

However, I'm still very interested in the sample rate being added as a tag. Of course, if my other ticket (arbitrary tags - https://github.com/dmachard/go-dnscollector/issues/471) is done I could do this "by hand" but it seems that the sample rate is inherently required for anything that does sampling so automatically adding it as a function of the "downsample" action seems like a good idea.

johnhtodd avatar Nov 20 '23 04:11 johnhtodd

I've missed a really obvious thing with my request as well, and that is necessary to make sense out of downsampled objects. The timestamp of first seen and last seen (though last seen can be implied from the timestamp of the object transmission.)

Let's use the example in the docs: "only keep 1 out of every downsample records, e.g. if set to 20, then this will return every 20th record, dropping 95% of queries." Let's say that the match criteria is very rare. It may be possible for a downsample to get a very small number of entries, and the queue will take a very long time to sample 20 entries, at which point it will emit a single record. So we would ideally want to know when the first matching event happened in the downsample bucket, and the last time (which would be the current time) that the matching event was seen, so that some sort of calculation could be performed downstream on how many events over time were actually happening.

My understanding: The only difference between the reducer and the downsampler is that with downsampling, the full packet of the last matching event is transmitted - all fields, including the sample rate and first seen timestamp - while with the reducer only the unique fields (and the count) should be transmitted at the end of the watching interval. Also, the downsampler sends events based on number of matching events, and not time. The reducer aggregates and sends events based on time windows. Therefore, the downsampler needs to include the element of first-time seen when the final matching trigger makes a transmission, otherwise it is impossible to reconstruct the duration of the sampled set. In other words, it needs to be possible to answer this question in our example: "How long did it take you to get 20 samples before you sent me this last one?"

johnhtodd avatar Nov 20 '23 22:11 johnhtodd

Hi @johnhtodd

In basic way, I made a first shot with the PR #480, other will come to cover this feature request.

To add mode flexibility with transformers . My proposal is to include a new section (likes routes) dedicated to describing the transformers to be applied.

transformers:
  - apply-to: [ tap ]
    actions: 
    - filtering:
        keep-domain-file: /etc/dnscollector/listA.txt
        downsample: 20
    - filtering:
        keep-domain-file: /etc/dnscollector/listB.txt
        downsample: 10

I need to check more but perhaps it's not complicated to implement

Denis

dmachard avatar Nov 23 '23 07:11 dmachard

Does the first filter action remove domains from the processing queue to be considered by the second filter?

This gets to one of the points I think I was not clearly raising, but is also implicit in some of my other questions: Is there some more generalized model that could be built as options within the transformations that has logic that allows for a forking branch (named next transformer/logger,) a duplication branch(es) which may fork or continue, or a halt, or a continuation through the existing flow? If this type of method doesn't end up being possible, then I can imagine lots of wasted cycles doing the same comparisons many times. This is not really related to the content of my original feature request, which was just having sample rates included in any output from a "downsample:" filter. :-)

johnhtodd avatar Nov 23 '23 18:11 johnhtodd