opentelemetry-rust icon indicating copy to clipboard operation
opentelemetry-rust copied to clipboard

Sampling Trouble

Open kevincox opened this issue 3 years ago • 4 comments

I was trying to use the opentelemetry::sdk::trace::ShouldSample API and am having trouble. I figured that I would at least write it down here in case others have the same issue or this could be used to improve the library. This is likely just a documentation problem, but I really struggled.

My goal was to export traces to Honeycomb with sampling. I managed to do something simple like this:

#[derive(Debug)]
struct Sampler;

impl opentelemetry::sdk::trace::ShouldSample for Sampler {
	fn should_sample(&self,
		parent_context: Option<&opentelemetry::Context>,
		trace_id: opentelemetry::trace::TraceId,
		_name: &str,
		_span_kind: &opentelemetry::trace::SpanKind,
		attributes: &[opentelemetry::KeyValue],
		_links: &[opentelemetry::trace::Link],
		_instrumentation_library: &opentelemetry::sdk::InstrumentationLibrary,
	) -> opentelemetry::sdk::trace::SamplingResult {
		let sample_rate: u128 = attributes.iter()
			.find(|kv| kv.key == opentelemetry::Key::from_static_str("sampleRate"))
			.map(|kv| match kv.value {
				opentelemetry::Value::I64(rate) => rate.try_into().unwrap(),
				ref value => panic!("Unexpected value {:?} for sample rate", value),
			})
			.unwrap_or(1)
			.max(1);

		opentelemetry::sdk::trace::SamplingResult {
			decision: if u128::from_le_bytes(trace_id.to_bytes()) % sample_rate == 0 {
				opentelemetry::sdk::trace::SamplingDecision::RecordAndSample
			} else {
				opentelemetry::sdk::trace::SamplingDecision::Drop
			},
			attributes: Vec::new(),
			trace_state: match parent_context {
				Some(ctx) => opentelemetry::trace::TraceContextExt::span(ctx).span_context().trace_state().clone(),
				None => opentelemetry::trace::TraceState::default(),
			},
		}
	}
}

I'm sure the code could be slightly better, for example there is definitely some bias on the sampling depending on the sampleRate used. However this had a lot of problems:

  1. The docs are super unclear. I mostly copied opentelemetry::sdk::trace::Sampler but it isn't even clear to me what most of the code does.
  2. This only sets the sampleRate attribute on the root span. Child spans and events don't have the value set which means that they are weighted incorrectly in graphs.
  3. This only appears to run on the root span. This means that I can't do different sampling at different levels. (For example children that are sampled less frequently than the root or children that may be traced even if the root isn't)
    • This is quite weird because opentelemetry::sdk::trace::Sampler seems to think that it can base sampling decisions off of the parent span, but my Sampler is never called with anything but the root.

It would be nice if something like this was easy and documented.

kevincox avatar May 18 '22 00:05 kevincox

Thanks for the feedback. I think the sampling method in the start span only called for span without a parent. For span with a parent, it will respect the sampling result of the parent spans.

Alternatively, you can set the sampling result for individual spans using SpanBuilder::sampling_result.

I can see a benefit to changing the behavior to run sampler for every span but it comes with a performance overhead. Gonna need to investigate it a little further

TommyCpp avatar May 18 '22 03:05 TommyCpp

I believer this may be both a documentation issue AND a bug. if you look at span creation in the Java SDK There is only reliance on the sampling decision. Whereas in the Rust SDK we're doing what almost appears as ParentBasedSampling before sampling is even considered.

Additionally when looking at the Sampler We should probably actually change the ParentBased(Box<Sampler>) to ParentBased(Box<dyn ShouldSample>).

I realize that both of these will have performance implications. Before I got on and put the work I figure I would comment first.

hdost avatar Jun 05 '22 10:06 hdost

@hdost yeah I think we should sampling at every span creation. Do you want to take this issue as you are already working on it?

TommyCpp avatar Jul 04 '22 21:07 TommyCpp

Also currently working on this in #839 , I believe it should be patched but I'm looking into how best to handle the tracing-opentelemetry crate.

hdost avatar Jul 17 '22 09:07 hdost

@kevincox we've added some further docs and updated the implementation as of 0.18.0 Going to consider this closed. If there's something missing you can comment.

hdost avatar Sep 22 '22 05:09 hdost

Do you have a link to these docs? I think I will also need to open another ticket for propagating sample rate information but I can do that in a more targeted ticket?

To summarize the status of the original points:

  1. I couldn't find more docs, but look forward to a link.
  2. Still an issue, don't know how to propagate down the tree. I will open a follow-up ticket/feature request.
  3. This seems to be fixed.

kevincox avatar Sep 23 '22 16:09 kevincox

For 1. I added docs https://github.com/open-telemetry/opentelemetry-rust/pull/807

For 2. https://github.com/open-telemetry/opentelemetry-rust/pull/839 this should have handled it.

hdost avatar Sep 27 '22 04:09 hdost

I don't think 2 is resolved. I have opened a new issue focused on it: https://github.com/open-telemetry/opentelemetry-rust/issues/886

kevincox avatar Sep 27 '22 19:09 kevincox