dd-trace-py
dd-trace-py copied to clipboard
feat(tracing): lazy sampling
This PR makes sampling lazy, effectively making our extended sampling API that can take resource and operation name far more useful, since we will now always be sampling after those tags have been added. Currently, often resource is added afterwards and operation name can change, therefore this PR essentially finishes the extension.
To make this PR simpler for reviewing, changes in this PR can be categorized as such:
-
Do not sample in
_start_span
-
Sample right before propagation by modifying all of the
HTTPPropagator.inject()
calls to take a tracer and span so we can sample there right before the sampling decision is propagated. -
Sample right before encoding using
TraceSamplingProcessor
-
Sample right before
_DBM_Propagator.inject()
-
Sample right before
os.fork
Design Decisions:
- Why are we running context.update_tags when we sample? We need the tags and metrics from the context on the root_span when we sample. This unfortunately leads to us running update_tags twice, once when we sample, and once in the TraceTagsProcessor right before we encode the trace chunk.
The alternative was pulling out the context._meta
and context._metrics
values to combine them with the span's metrics and tags values in sampling_rule.py::tags_match
or running sampling on those values in addition to the span's values to come up with a match. Both of these felt messier, and technically the possibility exists that the same tag key could exist on the context and the span, in which case the current span key should take precedence (as is the current behavior with context._update_tags). This could obviously be implemented with a continue
in the for loop evaluating tag rules against spans, however it adds to the mess. I'm still kind of on the fence about this choice and am happy to discuss.
I've added the update in sample, because there are multiple other places where we need to make sure that we update, before sampling, e.g. before propagation http, before propagation dbm. Therefore I didn't want to make sure we always ran context.update_tags after every egress point correctly before we sampled.
-
It's kind of messy to modify the
HttpPropagator.inject
signature to take a span.
No doubt, but I couldn't find a way around this besides either wrapping inject, with a method that samples before running it, or having to sample every place we call HttpPropagator.inject
. It seemed safer to me to call sampling inside of it. However it isn't great, because if customers are using inject() int their code, don't pass in a span, then sampling is not run. I will add a rn about that.
Another option was to always run ddtrace.tracer.current_root_span()
to grab the span, which we do as a back up if inject is called without a span kwarg value. However, Emmett and I decided that the ability to pass in an explicit span if possible should be included be the norm in our use.
I'm open to any suggestions regarding this.
- Ok so you instrumented os.fork to sample before we fork, but what about all of the other ways you can change execution context? E.g. spawn, pOpen, etc.? Can't that lead to differences in context.sampling_priority between the parent process's spans and the child processes spans?
Yes, it can. I'm not sure I can add instrumentation to all of the ways you can change execution context in this PR, however I'd be happy to hear advice on how to add it to any easy to reach places. I've also added to the documentation that you should sample manually before changing execution context. I'd imagine there are one or two spots that might be well worth it. Particularly I'm quite worried about things like breaking traces for gunicorn, gevent, celery, greenlet, etc.
-
What about when customers have an egress point that's not automatically instrumented or they don't use
HttpPropagator.inject
for? Won't that break sampling?
Yes, they'll need to sample manually before the injection. I've added that here in the docs.
Checklist
- [ ] Change(s) are motivated and described in the PR description
- [ ] Testing strategy is described if automated tests are not included in the PR
- [ ] Risks are described (performance impact, potential for breakage, maintainability)
- [ ] Change is maintainable (easy to change, telemetry, documentation)
- [ ] Library release note guidelines are followed or label
changelog/no-changelog
is set - [ ] Documentation is included (in-code, generated user docs, public corp docs)
- [ ] Backport labels are set (if applicable)
- [ ] If this PR changes the public interface, I've notified
@DataDog/apm-tees
. - [ ] If change touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.
Reviewer Checklist
- [ ] Title is accurate
- [ ] All changes are related to the pull request's stated goal
- [ ] Description motivates each change
- [ ] Avoids breaking API changes
- [ ] Testing strategy adequately addresses listed risks
- [ ] Change is maintainable (easy to change, telemetry, documentation)
- [ ] Release note makes sense to a user of the library
- [ ] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
- [ ] Backport labels are set in a manner that is consistent with the release branch maintenance policy
Benchmarks
Benchmark execution time: 2024-03-25 19:59:08
Comparing candidate commit 16ee92ab20b896a0e30a92ab51d7ae6c17458f30 in PR branch zachg/extended_sampling_api
with baseline commit 979970609faa2a203823e4246cea446674f197a5 in branch main
.
Found 29 performance improvements and 22 performance regressions! Performance is the same for 150 metrics, 9 unstable metrics.
scenario:coreapiscenario-context_with_data_listeners
- π©
max_rss_usage
[-715.592KB; -466.104KB] or [-3.267%; -2.128%]
scenario:coreapiscenario-context_with_data_only_all_listeners
- π©
max_rss_usage
[-723.165KB; -444.605KB] or [-3.305%; -2.032%]
scenario:coreapiscenario-core_dispatch_listeners
- π©
max_rss_usage
[-718.884KB; -440.284KB] or [-3.288%; -2.014%]
scenario:coreapiscenario-core_dispatch_no_listeners
- π©
max_rss_usage
[-765.277KB; -552.406KB] or [-3.497%; -2.525%]
scenario:coreapiscenario-core_dispatch_with_results_no_listeners
- π©
max_rss_usage
[-735.158KB; -634.544KB] or [-3.358%; -2.899%]
scenario:httppropagationextract-invalid_span_id_header
- π©
max_rss_usage
[-691.315KB; -447.373KB] or [-3.176%; -2.055%]
scenario:httppropagationextract-large_valid_headers_all
- π©
max_rss_usage
[-691.634KB; -600.245KB] or [-3.163%; -2.745%]
scenario:httppropagationextract-medium_valid_headers_all
- π©
max_rss_usage
[-747.678KB; -594.990KB] or [-3.419%; -2.721%]
scenario:httppropagationextract-wsgi_invalid_priority_header
- π©
max_rss_usage
[-728.896KB; -517.107KB] or [-3.353%; -2.379%]
scenario:httppropagationextract-wsgi_large_header_no_matches
- π©
max_rss_usage
[-1048.036KB; -854.147KB] or [-4.790%; -3.904%]
scenario:httppropagationextract-wsgi_medium_header_no_matches
- π©
max_rss_usage
[-948.758KB; -745.347KB] or [-4.341%; -3.410%]
scenario:httppropagationinject-ids_only
- π₯
execution_time
[+1.421Β΅s; +1.479Β΅s] or [+17.340%; +18.039%]
scenario:httppropagationinject-with_all
- π₯
execution_time
[+1.620Β΅s; +1.729Β΅s] or [+8.562%; +9.141%] - π©
max_rss_usage
[-766.099KB; -716.653KB] or [-3.514%; -3.287%]
scenario:httppropagationinject-with_dd_origin
- π₯
execution_time
[+1.365Β΅s; +1.419Β΅s] or [+10.881%; +11.310%]
scenario:httppropagationinject-with_priority_and_origin
- π₯
execution_time
[+1.429Β΅s; +1.519Β΅s] or [+10.043%; +10.673%] - π©
max_rss_usage
[-796.655KB; -734.429KB] or [-3.653%; -3.368%]
scenario:httppropagationinject-with_sampling_priority
- π₯
execution_time
[+1.443Β΅s; +1.511Β΅s] or [+14.304%; +14.977%] - π₯
max_rss_usage
[+730.758KB; +782.714KB] or [+3.472%; +3.719%]
scenario:httppropagationinject-with_tags
- π₯
execution_time
[+1.546Β΅s; +1.648Β΅s] or [+11.303%; +12.049%]
scenario:httppropagationinject-with_tags_invalid
- π₯
execution_time
[+1.539Β΅s; +1.626Β΅s] or [+9.716%; +10.263%]
scenario:httppropagationinject-with_tags_max_size
- π₯
execution_time
[+1.532Β΅s; +1.615Β΅s] or [+10.799%; +11.382%]
scenario:otelspan-add-metrics
- π©
execution_time
[-21.243ms; -17.912ms] or [-6.838%; -5.766%] - π©
max_rss_usage
[-1.046MB; -0.967MB] or [-2.829%; -2.616%]
scenario:otelspan-add-tags
- π©
execution_time
[-18.602ms; -15.619ms] or [-6.361%; -5.341%] - π©
max_rss_usage
[-1.065MB; -0.986MB] or [-2.876%; -2.664%]
scenario:otelspan-start
- π©
execution_time
[-18.838ms; -17.135ms] or [-36.598%; -33.289%] - π©
max_rss_usage
[-2.569MB; -2.411MB] or [-7.182%; -6.740%]
scenario:otelspan-start-finish
- π₯
max_rss_usage
[+610.028KB; +715.437KB] or [+2.713%; +3.182%]
scenario:otelspan-start-finish-telemetry
- π₯
max_rss_usage
[+600.494KB; +688.517KB] or [+2.669%; +3.061%]
scenario:samplingrules-low_match
- π₯
execution_time
[+15.602Β΅s; +22.237Β΅s] or [+3.420%; +4.875%]
scenario:sethttpmeta-collectipvariant_exists
- π©
max_rss_usage
[-774.131KB; -549.697KB] or [-3.501%; -2.486%]
scenario:sethttpmeta-no-collectipvariant
- π₯
max_rss_usage
[+682.816KB; +757.747KB] or [+3.187%; +3.537%]
scenario:sethttpmeta-no-useragentvariant
- π©
max_rss_usage
[-750.457KB; -539.373KB] or [-3.387%; -2.435%]
scenario:sethttpmeta-obfuscation-no-query
- π₯
max_rss_usage
[+658.548KB; +738.188KB] or [+3.073%; +3.444%]
scenario:sethttpmeta-obfuscation-regular-case-explicit-query
- π©
max_rss_usage
[-1.176MB; -1.110MB] or [-5.259%; -4.963%]
scenario:sethttpmeta-useragentvariant_not_exists_2
- π₯
max_rss_usage
[+731.051KB; +813.960KB] or [+3.414%; +3.802%]
scenario:span-add-metrics
- π©
execution_time
[-18.312ms; -16.996ms] or [-17.851%; -16.567%] - π©
max_rss_usage
[-2.452MB; -2.377MB] or [-4.320%; -4.189%]
scenario:span-add-tags
- π©
execution_time
[-16.959ms; -14.234ms] or [-10.340%; -8.679%] - π©
max_rss_usage
[-791.740KB; -711.492KB] or [-2.557%; -2.298%]
scenario:span-start
- π©
execution_time
[-15.279ms; -13.796ms] or [-46.532%; -42.018%]
scenario:span-start-finish
- π₯
execution_time
[+1.332ms; +2.302ms] or [+2.608%; +4.506%]
scenario:span-start-finish-telemetry
- π₯
execution_time
[+1.082ms; +1.996ms] or [+2.064%; +3.808%] - π₯
max_rss_usage
[+570.462KB; +724.693KB] or [+2.686%; +3.412%]
scenario:span-start-finish-traceid128
- π₯
execution_time
[+1.320ms; +2.478ms] or [+2.425%; +4.552%] - π₯
max_rss_usage
[+616.307KB; +766.912KB] or [+2.905%; +3.615%]
scenario:span-start-traceid128
- π©
execution_time
[-15.669ms; -14.206ms] or [-46.877%; -42.501%]
scenario:tracer-large
- π₯
max_rss_usage
[+668.594KB; +744.936KB] or [+3.021%; +3.366%]
scenario:tracer-medium
- π©
max_rss_usage
[-809.116KB; -726.884KB] or [-3.703%; -3.327%]
scenario:tracer-small
- π₯
max_rss_usage
[+709.645KB; +784.576KB] or [+3.362%; +3.717%]
Datadog Report
Branch report: zachg/extended_sampling_api
Commit report: 6797470
Test service: dd-trace-py
:white_check_mark: 0 Failed, 2766 Passed, 235 Skipped, 1h 23m 31.16s Total duration (4m 35.56s time saved)
Open questions:
-
Does this break gunicorn or gevent since for execution context switching I've only changed
os.fork
to sample before the context change? No it does not, we donβt actually ever start traces in the parent process, the request is just routed to the worker, and the trace starts there. In the event that middleware is added to a gunicorn + gevent application the middleware sits between the parent process routing the request and the worker, so sampling propagation should work fine there as well, as long as the middleware and the application logic are executing in the same execution context. -
Do I need to update the advanced section of asyncio like how I did the other context execution changing examples? https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#tracing-across-asyncio-tasks
-
Have I forgotten any other egress points where I should be running the sampler before?
I think I need to change the system-tests start_span method a bit to get sampling and propagation tests to pass
This parametric system tests run shows we pass all of the sampling test cases this PR should pass (the pr is just this one, but running a system-tests branch that enables all of the sampling tests, and has a modified inject method which will be merged once this is released. https://github.com/DataDog/dd-trace-py/actions/runs/8412705554/job/23033894434?pr=8715