dd-trace-py feat(tracing): lazy sampling

This PR makes sampling lazy, effectively making our extended sampling API that can take resource and operation name far more useful, since we will now always be sampling after those tags have been added. Currently, often resource is added afterwards and operation name can change, therefore this PR essentially finishes the extension.

To make this PR simpler for reviewing, changes in this PR can be categorized as such:

Do not sample in _start_span
Sample right before propagation by modifying all of the HTTPPropagator.inject() calls to take a tracer and span so we can sample there right before the sampling decision is propagated.
Sample right before encoding using TraceSamplingProcessor
Sample right before _DBM_Propagator.inject()
Sample right before os.fork

Design Decisions:

Why are we running context.update_tags when we sample? We need the tags and metrics from the context on the root_span when we sample. This unfortunately leads to us running update_tags twice, once when we sample, and once in the TraceTagsProcessor right before we encode the trace chunk.

The alternative was pulling out the context._meta and context._metrics values to combine them with the span's metrics and tags values in sampling_rule.py::tags_match or running sampling on those values in addition to the span's values to come up with a match. Both of these felt messier, and technically the possibility exists that the same tag key could exist on the context and the span, in which case the current span key should take precedence (as is the current behavior with context._update_tags). This could obviously be implemented with a continue in the for loop evaluating tag rules against spans, however it adds to the mess. I'm still kind of on the fence about this choice and am happy to discuss.

I've added the update in sample, because there are multiple other places where we need to make sure that we update, before sampling, e.g. before propagation http, before propagation dbm. Therefore I didn't want to make sure we always ran context.update_tags after every egress point correctly before we sampled.

It's kind of messy to modify the HttpPropagator.inject signature to take a span.

No doubt, but I couldn't find a way around this besides either wrapping inject, with a method that samples before running it, or having to sample every place we call HttpPropagator.inject. It seemed safer to me to call sampling inside of it. However it isn't great, because if customers are using inject() int their code, don't pass in a span, then sampling is not run. I will add a rn about that. Another option was to always run ddtrace.tracer.current_root_span() to grab the span, which we do as a back up if inject is called without a span kwarg value. However, Emmett and I decided that the ability to pass in an explicit span if possible should be included be the norm in our use. I'm open to any suggestions regarding this.

Ok so you instrumented os.fork to sample before we fork, but what about all of the other ways you can change execution context? E.g. spawn, pOpen, etc.? Can't that lead to differences in context.sampling_priority between the parent process's spans and the child processes spans?

Yes, it can. I'm not sure I can add instrumentation to all of the ways you can change execution context in this PR, however I'd be happy to hear advice on how to add it to any easy to reach places. I've also added to the documentation that you should sample manually before changing execution context. I'd imagine there are one or two spots that might be well worth it. Particularly I'm quite worried about things like breaking traces for gunicorn, gevent, celery, greenlet, etc.

What about when customers have an egress point that's not automatically instrumented or they don't use HttpPropagator.inject for? Won't that break sampling?

Yes, they'll need to sample manually before the injection. I've added that here in the docs.

Checklist

[ ] Change(s) are motivated and described in the PR description
[ ] Testing strategy is described if automated tests are not included in the PR
[ ] Risks are described (performance impact, potential for breakage, maintainability)
[ ] Change is maintainable (easy to change, telemetry, documentation)
[ ] Library release note guidelines are followed or label changelog/no-changelog is set
[ ] Documentation is included (in-code, generated user docs, public corp docs)
[ ] Backport labels are set (if applicable)
[ ] If this PR changes the public interface, I've notified @DataDog/apm-tees.
[ ] If change touches code that signs or publishes builds or packages, or handles credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.

Reviewer Checklist

[ ] Title is accurate
[ ] All changes are related to the pull request's stated goal
[ ] Description motivates each change
[ ] Avoids breaking API changes
[ ] Testing strategy adequately addresses listed risks
[ ] Change is maintainable (easy to change, telemetry, documentation)
[ ] Release note makes sense to a user of the library
[ ] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
[ ] Backport labels are set in a manner that is consistent with the release branch maintenance policy

Feb 06 '24 21:02 ZStriker19

Benchmarks

Benchmark execution time: 2024-03-25 19:59:08

Comparing candidate commit 16ee92ab20b896a0e30a92ab51d7ae6c17458f30 in PR branch zachg/extended_sampling_api with baseline commit 979970609faa2a203823e4246cea446674f197a5 in branch main.

Found 29 performance improvements and 22 performance regressions! Performance is the same for 150 metrics, 9 unstable metrics.

scenario:coreapiscenario-context_with_data_listeners

🟩 max_rss_usage [-715.592KB; -466.104KB] or [-3.267%; -2.128%]

scenario:coreapiscenario-context_with_data_only_all_listeners

🟩 max_rss_usage [-723.165KB; -444.605KB] or [-3.305%; -2.032%]

scenario:coreapiscenario-core_dispatch_listeners

🟩 max_rss_usage [-718.884KB; -440.284KB] or [-3.288%; -2.014%]

scenario:coreapiscenario-core_dispatch_no_listeners

🟩 max_rss_usage [-765.277KB; -552.406KB] or [-3.497%; -2.525%]

scenario:coreapiscenario-core_dispatch_with_results_no_listeners

🟩 max_rss_usage [-735.158KB; -634.544KB] or [-3.358%; -2.899%]

scenario:httppropagationextract-invalid_span_id_header

🟩 max_rss_usage [-691.315KB; -447.373KB] or [-3.176%; -2.055%]

scenario:httppropagationextract-large_valid_headers_all

🟩 max_rss_usage [-691.634KB; -600.245KB] or [-3.163%; -2.745%]

scenario:httppropagationextract-medium_valid_headers_all

🟩 max_rss_usage [-747.678KB; -594.990KB] or [-3.419%; -2.721%]

scenario:httppropagationextract-wsgi_invalid_priority_header

🟩 max_rss_usage [-728.896KB; -517.107KB] or [-3.353%; -2.379%]

scenario:httppropagationextract-wsgi_large_header_no_matches

🟩 max_rss_usage [-1048.036KB; -854.147KB] or [-4.790%; -3.904%]

scenario:httppropagationextract-wsgi_medium_header_no_matches

🟩 max_rss_usage [-948.758KB; -745.347KB] or [-4.341%; -3.410%]

scenario:httppropagationinject-ids_only

🟥 execution_time [+1.421µs; +1.479µs] or [+17.340%; +18.039%]

scenario:httppropagationinject-with_all

🟥 execution_time [+1.620µs; +1.729µs] or [+8.562%; +9.141%]
🟩 max_rss_usage [-766.099KB; -716.653KB] or [-3.514%; -3.287%]

scenario:httppropagationinject-with_dd_origin

🟥 execution_time [+1.365µs; +1.419µs] or [+10.881%; +11.310%]

scenario:httppropagationinject-with_priority_and_origin

🟥 execution_time [+1.429µs; +1.519µs] or [+10.043%; +10.673%]
🟩 max_rss_usage [-796.655KB; -734.429KB] or [-3.653%; -3.368%]

scenario:httppropagationinject-with_sampling_priority

🟥 execution_time [+1.443µs; +1.511µs] or [+14.304%; +14.977%]
🟥 max_rss_usage [+730.758KB; +782.714KB] or [+3.472%; +3.719%]

scenario:httppropagationinject-with_tags

🟥 execution_time [+1.546µs; +1.648µs] or [+11.303%; +12.049%]

scenario:httppropagationinject-with_tags_invalid

🟥 execution_time [+1.539µs; +1.626µs] or [+9.716%; +10.263%]

scenario:httppropagationinject-with_tags_max_size

🟥 execution_time [+1.532µs; +1.615µs] or [+10.799%; +11.382%]

scenario:otelspan-add-metrics

🟩 execution_time [-21.243ms; -17.912ms] or [-6.838%; -5.766%]
🟩 max_rss_usage [-1.046MB; -0.967MB] or [-2.829%; -2.616%]

scenario:otelspan-add-tags

🟩 execution_time [-18.602ms; -15.619ms] or [-6.361%; -5.341%]
🟩 max_rss_usage [-1.065MB; -0.986MB] or [-2.876%; -2.664%]

scenario:otelspan-start

🟩 execution_time [-18.838ms; -17.135ms] or [-36.598%; -33.289%]
🟩 max_rss_usage [-2.569MB; -2.411MB] or [-7.182%; -6.740%]

scenario:otelspan-start-finish

🟥 max_rss_usage [+610.028KB; +715.437KB] or [+2.713%; +3.182%]

scenario:otelspan-start-finish-telemetry

🟥 max_rss_usage [+600.494KB; +688.517KB] or [+2.669%; +3.061%]

scenario:samplingrules-low_match

🟥 execution_time [+15.602µs; +22.237µs] or [+3.420%; +4.875%]

scenario:sethttpmeta-collectipvariant_exists

🟩 max_rss_usage [-774.131KB; -549.697KB] or [-3.501%; -2.486%]

scenario:sethttpmeta-no-collectipvariant

🟥 max_rss_usage [+682.816KB; +757.747KB] or [+3.187%; +3.537%]

scenario:sethttpmeta-no-useragentvariant

🟩 max_rss_usage [-750.457KB; -539.373KB] or [-3.387%; -2.435%]

scenario:sethttpmeta-obfuscation-no-query

🟥 max_rss_usage [+658.548KB; +738.188KB] or [+3.073%; +3.444%]

scenario:sethttpmeta-obfuscation-regular-case-explicit-query

🟩 max_rss_usage [-1.176MB; -1.110MB] or [-5.259%; -4.963%]

scenario:sethttpmeta-useragentvariant_not_exists_2

🟥 max_rss_usage [+731.051KB; +813.960KB] or [+3.414%; +3.802%]

scenario:span-add-metrics

🟩 execution_time [-18.312ms; -16.996ms] or [-17.851%; -16.567%]
🟩 max_rss_usage [-2.452MB; -2.377MB] or [-4.320%; -4.189%]

scenario:span-add-tags

🟩 execution_time [-16.959ms; -14.234ms] or [-10.340%; -8.679%]
🟩 max_rss_usage [-791.740KB; -711.492KB] or [-2.557%; -2.298%]

scenario:span-start

🟩 execution_time [-15.279ms; -13.796ms] or [-46.532%; -42.018%]

scenario:span-start-finish

🟥 execution_time [+1.332ms; +2.302ms] or [+2.608%; +4.506%]

scenario:span-start-finish-telemetry

🟥 execution_time [+1.082ms; +1.996ms] or [+2.064%; +3.808%]
🟥 max_rss_usage [+570.462KB; +724.693KB] or [+2.686%; +3.412%]

scenario:span-start-finish-traceid128

🟥 execution_time [+1.320ms; +2.478ms] or [+2.425%; +4.552%]
🟥 max_rss_usage [+616.307KB; +766.912KB] or [+2.905%; +3.615%]

scenario:span-start-traceid128

🟩 execution_time [-15.669ms; -14.206ms] or [-46.877%; -42.501%]

scenario:tracer-large

🟥 max_rss_usage [+668.594KB; +744.936KB] or [+3.021%; +3.366%]

scenario:tracer-medium

🟩 max_rss_usage [-809.116KB; -726.884KB] or [-3.703%; -3.327%]

scenario:tracer-small

🟥 max_rss_usage [+709.645KB; +784.576KB] or [+3.362%; +3.717%]

Feb 06 '24 23:02 pr-commenter[bot]

Datadog Report

Branch report: zachg/extended_sampling_api Commit report: 6797470 Test service: dd-trace-py

:white_check_mark: 0 Failed, 2766 Passed, 235 Skipped, 1h 23m 31.16s Total duration (4m 35.56s time saved)

Feb 12 '24 19:02 datadog-dd-trace-py-rkomorn[bot]

Open questions:

Does this break gunicorn or gevent since for execution context switching I've only changed os.fork to sample before the context change? No it does not, we don’t actually ever start traces in the parent process, the request is just routed to the worker, and the trace starts there. In the event that middleware is added to a gunicorn + gevent application the middleware sits between the parent process routing the request and the worker, so sampling propagation should work fine there as well, as long as the middleware and the application logic are executing in the same execution context.
Do I need to update the advanced section of asyncio like how I did the other context execution changing examples? https://ddtrace.readthedocs.io/en/stable/advanced_usage.html#tracing-across-asyncio-tasks
Have I forgotten any other egress points where I should be running the sampler before?

Mar 04 '24 22:03 ZStriker19

I think I need to change the system-tests start_span method a bit to get sampling and propagation tests to pass

Mar 20 '24 21:03 ZStriker19

This parametric system tests run shows we pass all of the sampling test cases this PR should pass (the pr is just this one, but running a system-tests branch that enables all of the sampling tests, and has a modified inject method which will be merged once this is released. https://github.com/DataDog/dd-trace-py/actions/runs/8412705554/job/23033894434?pr=8715

Mar 25 '24 14:03 ZStriker19

dd-trace-py dd-trace-py copied to clipboard

feat(tracing): lazy sampling

Checklist

Reviewer Checklist

Benchmarks

scenario:coreapiscenario-context_with_data_listeners

scenario:coreapiscenario-context_with_data_only_all_listeners

scenario:coreapiscenario-core_dispatch_listeners

scenario:coreapiscenario-core_dispatch_no_listeners

scenario:coreapiscenario-core_dispatch_with_results_no_listeners

scenario:httppropagationextract-invalid_span_id_header

scenario:httppropagationextract-large_valid_headers_all

scenario:httppropagationextract-medium_valid_headers_all

scenario:httppropagationextract-wsgi_invalid_priority_header

scenario:httppropagationextract-wsgi_large_header_no_matches

scenario:httppropagationextract-wsgi_medium_header_no_matches

scenario:httppropagationinject-ids_only

scenario:httppropagationinject-with_all

scenario:httppropagationinject-with_dd_origin

scenario:httppropagationinject-with_priority_and_origin

scenario:httppropagationinject-with_sampling_priority

scenario:httppropagationinject-with_tags

scenario:httppropagationinject-with_tags_invalid

scenario:httppropagationinject-with_tags_max_size

scenario:otelspan-add-metrics

scenario:otelspan-add-tags

scenario:otelspan-start

scenario:otelspan-start-finish

scenario:otelspan-start-finish-telemetry

scenario:samplingrules-low_match

scenario:sethttpmeta-collectipvariant_exists

scenario:sethttpmeta-no-collectipvariant

scenario:sethttpmeta-no-useragentvariant

scenario:sethttpmeta-obfuscation-no-query

scenario:sethttpmeta-obfuscation-regular-case-explicit-query

scenario:sethttpmeta-useragentvariant_not_exists_2

scenario:span-add-metrics

scenario:span-add-tags

scenario:span-start

scenario:span-start-finish

scenario:span-start-finish-telemetry

scenario:span-start-finish-traceid128

scenario:span-start-traceid128

scenario:tracer-large

scenario:tracer-medium

scenario:tracer-small

Datadog Report

dd-trace-py
dd-trace-py copied to clipboard