Sampling Trouble
I was trying to use the opentelemetry::sdk::trace::ShouldSample API and am having trouble. I figured that I would at least write it down here in case others have the same issue or this could be used to improve the library. This is likely just a documentation problem, but I really struggled.
My goal was to export traces to Honeycomb with sampling. I managed to do something simple like this:
#[derive(Debug)]
struct Sampler;
impl opentelemetry::sdk::trace::ShouldSample for Sampler {
fn should_sample(&self,
parent_context: Option<&opentelemetry::Context>,
trace_id: opentelemetry::trace::TraceId,
_name: &str,
_span_kind: &opentelemetry::trace::SpanKind,
attributes: &[opentelemetry::KeyValue],
_links: &[opentelemetry::trace::Link],
_instrumentation_library: &opentelemetry::sdk::InstrumentationLibrary,
) -> opentelemetry::sdk::trace::SamplingResult {
let sample_rate: u128 = attributes.iter()
.find(|kv| kv.key == opentelemetry::Key::from_static_str("sampleRate"))
.map(|kv| match kv.value {
opentelemetry::Value::I64(rate) => rate.try_into().unwrap(),
ref value => panic!("Unexpected value {:?} for sample rate", value),
})
.unwrap_or(1)
.max(1);
opentelemetry::sdk::trace::SamplingResult {
decision: if u128::from_le_bytes(trace_id.to_bytes()) % sample_rate == 0 {
opentelemetry::sdk::trace::SamplingDecision::RecordAndSample
} else {
opentelemetry::sdk::trace::SamplingDecision::Drop
},
attributes: Vec::new(),
trace_state: match parent_context {
Some(ctx) => opentelemetry::trace::TraceContextExt::span(ctx).span_context().trace_state().clone(),
None => opentelemetry::trace::TraceState::default(),
},
}
}
}
I'm sure the code could be slightly better, for example there is definitely some bias on the sampling depending on the sampleRate used. However this had a lot of problems:
- The docs are super unclear. I mostly copied
opentelemetry::sdk::trace::Samplerbut it isn't even clear to me what most of the code does. - This only sets the
sampleRateattribute on the root span. Child spans and events don't have the value set which means that they are weighted incorrectly in graphs. - This only appears to run on the root span. This means that I can't do different sampling at different levels. (For example children that are sampled less frequently than the root or children that may be traced even if the root isn't)
- This is quite weird because
opentelemetry::sdk::trace::Samplerseems to think that it can base sampling decisions off of the parent span, but my Sampler is never called with anything but the root.
- This is quite weird because
It would be nice if something like this was easy and documented.
Thanks for the feedback. I think the sampling method in the start span only called for span without a parent. For span with a parent, it will respect the sampling result of the parent spans.
Alternatively, you can set the sampling result for individual spans using SpanBuilder::sampling_result.
I can see a benefit to changing the behavior to run sampler for every span but it comes with a performance overhead. Gonna need to investigate it a little further
I believer this may be both a documentation issue AND a bug. if you look at span creation in the Java SDK There is only reliance on the sampling decision. Whereas in the Rust SDK we're doing what almost appears as ParentBasedSampling before sampling is even considered.
Additionally when looking at the Sampler We should probably actually change the ParentBased(Box<Sampler>) to ParentBased(Box<dyn ShouldSample>).
I realize that both of these will have performance implications. Before I got on and put the work I figure I would comment first.
@hdost yeah I think we should sampling at every span creation. Do you want to take this issue as you are already working on it?
Also currently working on this in #839 , I believe it should be patched but I'm looking into how best to handle the tracing-opentelemetry crate.
@kevincox we've added some further docs and updated the implementation as of 0.18.0 Going to consider this closed. If there's something missing you can comment.
Do you have a link to these docs? I think I will also need to open another ticket for propagating sample rate information but I can do that in a more targeted ticket?
To summarize the status of the original points:
- I couldn't find more docs, but look forward to a link.
- Still an issue, don't know how to propagate down the tree. I will open a follow-up ticket/feature request.
- This seems to be fixed.
For 1. I added docs https://github.com/open-telemetry/opentelemetry-rust/pull/807
For 2. https://github.com/open-telemetry/opentelemetry-rust/pull/839 this should have handled it.
I don't think 2 is resolved. I have opened a new issue focused on it: https://github.com/open-telemetry/opentelemetry-rust/issues/886