opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

Refactor the probabilistic sampler processor; add FailClosed configuration, prepare for OTEP 235 support

Open jmacd opened this issue 1 year ago • 3 comments

Description:

Refactors the probabilistic sampling processor to prepare it for more OTEP 235 support.

This clarifies existing inconsistencies between tracing and logging samplers, see the updated README. The tracing priority mechanism applies a 0% or 100% sampling override (e.g., "1" implies 100% sampling), whereas the logging sampling priority mechanism supports variable-probability override (e.g., "1" implies 1% sampling).

This pins down cases where no randomness is available, and organizes the code to improve readability. A new type called randomnessNamer carries the randomness information (from the sampling pacakge) and a name of the policy that derived it. When sampling priority causes the effective sampling probability to change, the value "sampling.priority" replaces the source of randomness, which is currently limited to "trace_id_hash" or the name of the randomess-source attribute, for logs.

While working on #31894, I discovered that some inputs fall through to the hash function with zero bytes of input randomness. The hash function, computed on an empty input (for logs) or on 16 bytes of zeros (which OTel calls an invalid trace ID), would produce a fixed random value. So, for example, when logs are sampled and there is no TraceID and there is no randomness attribute value, the result will be sampled at approximately 82.9% and above.

In the refactored code, an error is returned when there is no input randomness. A new boolean configuration field determines the outcome when there is an error extracting randomness from an item of telemetry. By default, items of telemetry with errors will not pass through the sampler. When FailClosed is set to false, items of telemetry with errors will pass through the sampler.

The original hash function, which uses 14 bits of information, is structured as an "acceptance threshold", ultimately the test for sampling translated into a positive decision when Randomness < AcceptThreshold. In the OTEP 235 scheme, thresholds are rejection thresholds--this PR modifies the original 14-bit accept threshold into a 56-bit reject threshold, using Threshold and Randomness types from the sampling package. Reframed in this way, in the subsequent PR (i.e., #31894) the effective sampling probability will be seamlessly conveyed using OTEP 235 semantic conventions.

Note, both traces and logs processors are now reduced to a function like this:

				return commonSamplingLogic(
					ctx,
					l,
					lsp.sampler,
					lsp.failClosed,
					lsp.sampler.randomnessFromLogRecord,
					lsp.priorityFunc,
					"logs sampler",
					lsp.logger,
				)

which is a generic function that handles the common logic on a per-item basis and ends in a single metric event. This structure makes it clear how traces and logs are processed differently and have different prioritization schemes, currently. This structure also makes it easy to introduce new sampler modes, as shown in #31894. After this and #31940 merge, the changes in #31894 will be relatively simple to review as the third part in a series.

Link to tracking Issue:

Depends on #31940. Part of #31918.

Testing: Added. Existing tests already cover the exact random behavior of the current hashing mechanism. Even more testing will be introduced with the last step of this series. Note that https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/32360 is added ahead of this test to ensure refactoring does not change results.

Documentation: Added.

jmacd avatar Mar 25 '24 23:03 jmacd

Reviewers: this is a draft until https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/31940 merges.

jmacd avatar Mar 25 '24 23:03 jmacd

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Apr 12 '24 05:04 github-actions[bot]

~make genotelcontribcol is un-doing the effect of make crosslink and I don't know what to do next.~ (EDIT: the builder configuration has to list replace statements, crosslink won't stick.)

jmacd avatar Apr 16 '24 22:04 jmacd