dd-trace-dotnet icon indicating copy to clipboard operation
dd-trace-dotnet copied to clipboard

Fix `StackOverflowException` issues when an application uses custom `AssemblyLoadContext`'s and has a manual/automatic tracer version mismatch

Open zacharycmontoya opened this issue 3 years ago • 2 comments

Summary of changes

Add a new integration to insert ourselves into the AssemblyLoadContext loading process to prevent StackOverflowException's from occurring when there is a manual/automatic tracer version mismatch and the application contains custom AssemblyLoadContexts.

Reason for change

In a manual/automatic tracer version mismatch scenario, the "manual" instrumentation assembly is rewritten so that DistributedTracer.GetDistributedTracer() calls the DistributedTracer.GetDistributedTracer() method from the instrumentation assembly. This is done by injecting a method invocation using a new set of AssemblyRef/TypeRef/MemberRef tokens to the new assembly and letting the runtime resolve the assembly. However, custom AssemblyLoadContexts and assembly-resolving event handlers may force the assembly reference to resolve to itself, causing the method to call itself and eventually raise a StackOverflowException.

Implementation details

The proposed solution is to inject ourselves into AssemblyLoadContext methods to force the Datadog.Trace assembly references to resolve to the automatic instrumentation assembly. Changes include:

  • Adding a new integrations to rewrite System.Runtime.Loader.AssemblyLoadContext.ResolveUsingLoad, which is responsible for invoking AssemblyLoadContext.Load. This will interrupt all ALC's so we can return the automatic instrumentation's copy of Datadog.Trace if the requested assembly is Datadog.Trace and it has a version less than or equal to the instrumentation's version.
  • Adding a new integration category Internal
  • Adding an event handler for System.Runtime.Loader.AssemblyLoadContext.Default.Resolving

Note: Assemblies loaded via AssemblyRefs and TypeRefs and loaded from the Default AssemblyLoadContext do not invoke the methods we instrument, so the application may still load a lower version of Datadog.Trace if directly referenced from the application code.

Test coverage

Tested the following scenarios locally:

  • Application running on (local) Azure Functions Runtime v3 with tracer version mismatch
  • xUnit test project with tracer version mismatch
  • Console application with tracer version mismatch

Other details

Remaining items:

  • [ ] Ensure the Internal integration cannot be selectively disabled
  • [ ] For whatever reason, trying to rewrite System.Runtime.Loader.AssemblyLoadContext.ResolveUsingEvent causes issues with the Azure Functions runtime. Ideally, we'd understand WHY we can't do that, before merging.
  • [ ] For some reason we need to add a Resolving event to the Default ALC. Make sure to understand why
  • [ ] Add automated tests, not just local one-off tests

zacharycmontoya avatar May 07 '22 01:05 zacharycmontoya

Benchmarks Report :snail:

Benchmarks for #2770 compared to master:

  • All benchmarks have the same speed
  • All benchmarks have the same allocations

The following thresholds were used for comparing the benchmark speeds:

  • Mann–Whitney U test with statistical test for significance of 5%
  • Only results indicating a difference greater than 10% and 0.3 ns are considered.

Allocation changes below 0.5% are ignored.

Benchmark details

Benchmarks.Trace.AgentWriterBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master WriteAndFlushEnrichedTraces net472 716μs 233ns 904ns 0.359 0 0 3.16 KB
master WriteAndFlushEnrichedTraces netcoreapp3.1 498μs 242ns 905ns 0 0 0 2.57 KB
#2770 WriteAndFlushEnrichedTraces net472 698μs 302ns 1.13μs 0 0 0 3.16 KB
#2770 WriteAndFlushEnrichedTraces netcoreapp3.1 499μs 347ns 1.25μs 0 0 0 2.57 KB
Benchmarks.Trace.AppSecBodyBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master AllCycleSimpleBody net472 194ns 0.128ns 0.48ns 0.0662 0 0 417 B
master AllCycleSimpleBody netcoreapp3.1 243ns 0.0927ns 0.334ns 0.00573 0 0 416 B
master AllCycleMoreComplexBody net472 192ns 0.0785ns 0.294ns 0.0624 0 0 393 B
master AllCycleMoreComplexBody netcoreapp3.1 242ns 0.141ns 0.547ns 0.00536 0 0 392 B
master BodyExtractorSimpleBody net472 270ns 0.216ns 0.837ns 0.0572 0 0 361 B
master BodyExtractorSimpleBody netcoreapp3.1 225ns 0.577ns 2.16ns 0.00365 0 0 272 B
master BodyExtractorMoreComplexBody net472 15.6μs 15.5ns 60ns 1.21 0.0156 0 7.62 KB
master BodyExtractorMoreComplexBody netcoreapp3.1 12.3μs 20.8ns 80.5ns 0.0924 0 0 6.75 KB
#2770 AllCycleSimpleBody net472 197ns 0.262ns 1.02ns 0.0661 0 0 417 B
#2770 AllCycleSimpleBody netcoreapp3.1 248ns 0.139ns 0.54ns 0.00569 0 0 416 B
#2770 AllCycleMoreComplexBody net472 193ns 0.0812ns 0.293ns 0.0623 0 0 393 B
#2770 AllCycleMoreComplexBody netcoreapp3.1 246ns 0.215ns 0.831ns 0.00535 0 0 392 B
#2770 BodyExtractorSimpleBody net472 274ns 0.308ns 1.19ns 0.0571 0 0 361 B
#2770 BodyExtractorSimpleBody netcoreapp3.1 232ns 0.266ns 0.994ns 0.00373 0 0 272 B
#2770 BodyExtractorMoreComplexBody net472 16.3μs 10.1ns 37.8ns 1.19 0.0162 0 7.62 KB
#2770 BodyExtractorMoreComplexBody netcoreapp3.1 12.9μs 9.13ns 35.4ns 0.0904 0 0 6.75 KB
Benchmarks.Trace.AspNetCoreBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendRequest net472 0ns 0ns 0ns 0 0 0 0 b
master SendRequest netcoreapp3.1 172μs 54.7ns 197ns 0.259 0 0 19.74 KB
#2770 SendRequest net472 0ns 0ns 0ns 0 0 0 0 b
#2770 SendRequest netcoreapp3.1 172μs 141ns 529ns 0.135 0 0 19.74 KB
Benchmarks.Trace.DbCommandBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteNonQuery net472 1.42μs 0.472ns 1.83ns 0.0941 0.000713 0 594 B
master ExecuteNonQuery netcoreapp3.1 1.14μs 4.98ns 19.3ns 0.00852 0 0 632 B
#2770 ExecuteNonQuery net472 1.39μs 0.331ns 1.28ns 0.0938 0.0007 0 594 B
#2770 ExecuteNonQuery netcoreapp3.1 1.16μs 0.437ns 1.69ns 0.00868 0 0 632 B
Benchmarks.Trace.ElasticsearchBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master CallElasticsearch net472 2.01μs 1.16ns 4.49ns 0.127 0 0 802 B
master CallElasticsearch netcoreapp3.1 1.35μs 0.426ns 1.65ns 0.0108 0 0 792 B
master CallElasticsearchAsync net472 2.27μs 0.667ns 2.5ns 0.149 0 0 939 B
master CallElasticsearchAsync netcoreapp3.1 1.31μs 0.596ns 2.06ns 0.0124 0 0 912 B
#2770 CallElasticsearch net472 2.11μs 1.2ns 4.66ns 0.125 0 0 802 B
#2770 CallElasticsearch netcoreapp3.1 1.24μs 1.07ns 4.13ns 0.0105 0 0 792 B
#2770 CallElasticsearchAsync net472 2.15μs 0.675ns 2.52ns 0.147 0 0 939 B
#2770 CallElasticsearchAsync netcoreapp3.1 1.33μs 0.274ns 1.02ns 0.0119 0 0 912 B
Benchmarks.Trace.GraphQLBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master ExecuteAsync net472 2.32μs 1.21ns 4.68ns 0.167 0 0 1.06 KB
master ExecuteAsync netcoreapp3.1 1.47μs 1.31ns 4.9ns 0.0139 0 0 1.03 KB
#2770 ExecuteAsync net472 2.2μs 1.1ns 4.27ns 0.166 0 0 1.06 KB
#2770 ExecuteAsync netcoreapp3.1 1.48μs 0.946ns 3.54ns 0.014 0 0 1.03 KB
Benchmarks.Trace.HttpClientBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendAsync net472 4.88μs 1.81ns 6.78ns 0.359 0 0 2.28 KB
master SendAsync netcoreapp3.1 3.16μs 1.87ns 6.99ns 0.0302 0 0 2.21 KB
#2770 SendAsync net472 4.64μs 1.98ns 7.68ns 0.357 0 0 2.28 KB
#2770 SendAsync netcoreapp3.1 3.27μs 1.76ns 6.59ns 0.0294 0 0 2.21 KB
Benchmarks.Trace.ILoggerBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 2.73μs 1.08ns 3.76ns 0.23 0 0 1.45 KB
master EnrichedLog netcoreapp3.1 2.2μs 1.16ns 4.34ns 0.0208 0 0 1.53 KB
#2770 EnrichedLog net472 2.67μs 1.91ns 7.39ns 0.228 0 0 1.45 KB
#2770 EnrichedLog netcoreapp3.1 2.18μs 1.49ns 5.78ns 0.0207 0 0 1.53 KB
Benchmarks.Trace.Log4netBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 143μs 127ns 493ns 0.641 0.214 0 4.29 KB
master EnrichedLog netcoreapp3.1 109μs 118ns 459ns 0.0551 0 0 4.18 KB
#2770 EnrichedLog net472 142μs 184ns 713ns 0.505 0.144 0 4.29 KB
#2770 EnrichedLog netcoreapp3.1 109μs 31.5ns 114ns 0.0542 0 0 4.18 KB
Benchmarks.Trace.NLogBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 5.36μs 1.35ns 5.22ns 0.51 0.00266 0 3.23 KB
master EnrichedLog netcoreapp3.1 4.05μs 1.39ns 5.19ns 0.0485 0 0 3.6 KB
#2770 EnrichedLog net472 5.44μs 2.25ns 8.73ns 0.508 0.00269 0 3.23 KB
#2770 EnrichedLog netcoreapp3.1 4.02μs 1.9ns 7.34ns 0.05 0 0 3.6 KB
Benchmarks.Trace.RedisBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master SendReceive net472 1.75μs 0.777ns 2.8ns 0.16 0 0 1.01 KB
master SendReceive netcoreapp3.1 1.46μs 0.733ns 2.74ns 0.0138 0 0 1.01 KB
#2770 SendReceive net472 1.74μs 0.843ns 3.15ns 0.159 0 0 1.01 KB
#2770 SendReceive netcoreapp3.1 1.6μs 0.392ns 1.47ns 0.0136 0 0 1.01 KB
Benchmarks.Trace.SerilogBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master EnrichedLog net472 4.51μs 2.32ns 9ns 0.294 0 0 1.87 KB
master EnrichedLog netcoreapp3.1 3.92μs 2.47ns 9.25ns 0.0195 0 0 1.49 KB
#2770 EnrichedLog net472 4.61μs 2.09ns 7.81ns 0.291 0 0 1.87 KB
#2770 EnrichedLog netcoreapp3.1 4.04μs 5.96ns 23.1ns 0.0201 0 0 1.49 KB
Benchmarks.Trace.SpanBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master StartFinishSpan net472 796ns 0.453ns 1.7ns 0.0723 0 0 457 B
master StartFinishSpan netcoreapp3.1 623ns 0.179ns 0.692ns 0.00626 0 0 456 B
master StartFinishScope net472 1.04μs 0.458ns 1.78ns 0.0849 0 0 538 B
master StartFinishScope netcoreapp3.1 773ns 0.254ns 0.916ns 0.0077 0 0 576 B
#2770 StartFinishSpan net472 783ns 0.105ns 0.38ns 0.0717 0 0 457 B
#2770 StartFinishSpan netcoreapp3.1 626ns 0.215ns 0.776ns 0.00624 0 0 456 B
#2770 StartFinishScope net472 1.05μs 1.76ns 6.83ns 0.0845 0 0 538 B
#2770 StartFinishScope netcoreapp3.1 781ns 0.211ns 0.76ns 0.00773 0 0 576 B
Benchmarks.Trace.TraceAnnotationsBenchmark - Same speed :heavy_check_mark: Same allocations :heavy_check_mark:

Raw results

Branch Method Toolchain Mean StdError StdDev Gen 0 Gen 1 Gen 2 Allocated
master RunOnMethodBegin net472 1.12μs 0.894ns 3.46ns 0.0849 0 0 538 B
master RunOnMethodBegin netcoreapp3.1 925ns 0.248ns 0.962ns 0.00788 0 0 576 B
#2770 RunOnMethodBegin net472 1.12μs 0.923ns 3.57ns 0.0842 0 0 538 B
#2770 RunOnMethodBegin netcoreapp3.1 898ns 3.93ns 15.2ns 0.00773 0 0 576 B

andrewlock avatar May 07 '22 01:05 andrewlock

Code Coverage Report :bar_chart:

:heavy_check_mark: Merging #2770 into master will not change line coverage :heavy_check_mark: Merging #2770 into master will not change branch coverage :no_entry: Merging #2770 into master will will increase complexity by 43

master #2770 Change
Lines 14230 / 19505 14243 / 19529
Lines % 73% 73% 0% :heavy_check_mark:
Branches 8275 / 11775 8279 / 11809
Branches % 70% 70% 0% :heavy_check_mark:
Complexity 13053 13096 43 :no_entry:

View the full report for further details:

Datadog.Trace Breakdown :heavy_check_mark:

master #2770 Change
Lines % 73% 73% 0% :heavy_check_mark:
Branches % 70% 70% 0% :heavy_check_mark:
Complexity 13053 13096 43 :no_entry:

The following classes have significant coverage changes.

File Line coverage change Branch coverage change Complexity change
Datadog.Trace.Telemetry.JsonWebRequestTelemetryTransport -27% :no_entry: -57% :no_entry: 0 :heavy_check_mark:
Datadog.Trace.ClrProfiler.CallTarget.CallTargetInvoker 0% :heavy_check_mark: 7% :heavy_check_mark: 36 :no_entry:
Datadog.Trace.Ci.CIVisibility 7% :heavy_check_mark: 7% :heavy_check_mark: 0 :heavy_check_mark:
Datadog.Trace.Telemetry.JsonHttpClientTelemetryTransport 26% :heavy_check_mark: 38% :heavy_check_mark: 0 :heavy_check_mark:

The following classes were added in #2770:

File Line coverage Branch coverage Complexity
Datadog.Trace.ClrProfiler.AutoInstrumentation.Internal.AssemblyLoadContext_Integration 100% 100% 1

View the full reports for further details:

andrewlock avatar May 07 '22 02:05 andrewlock

Closing, this is no longer needed as we found a different fix at the time

zacharycmontoya avatar Mar 29 '23 16:03 zacharycmontoya