opentelemetry-dotnet-instrumentation
opentelemetry-dotnet-instrumentation copied to clipboard
Flaky test: IntegrationTests.StrongNamedTests.SubmitsTraces
4 instances:
- https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3069538719/jobs/4958307009#step:4:2590
- https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3069538719/jobs/4958306796#step:4:2682
- https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3069538719/jobs/4958306876#step:4:3392
- https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3062693118/jobs/4943951142#step:4:3392
Confirming that is actually flakyness: alpine, macos, and windows passed on retry on PR #1234
Some span is captured but it does not have the expected attribute: https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3097103461/jobs/5013412328
It seems that the issue is related to the ReJIT happening asynchronous. In a failed pass of the test we have:
[2022-09-22T02:47:31.803708579Z] [3580|3580] [debug] Enqueue for ReJIT [ModuleId=140150729282504, MethodDef=06000001, AppDomainId=94699831388896, IsDomainNeutral=0, Assembly=TestLibrary.InstrumentationTarget, Type=TestLibrary.InstrumentationTarget.Command, Method=Execute, Signature=200001]
[2022-09-22T02:47:31.803725179Z] [3580|3580] [debug] ModuleLoadFinished stored metadata for 140150729282504 TestLibrary.InstrumentationTarget AppDomain 94699831388896 [clrhost]
[2022-09-22T02:47:31.803728679Z] [3580|3580] [debug] AssemblyLoadFinished: 94699845728400 0
[2022-09-22T02:47:31.803896979Z] [3580|3580] [debug] AssemblyLoadStarted: 94699845775072
[2022-09-22T02:47:31.803910079Z] [3580|3580] [debug] ModuleLoadStarted: 140150729284448
[2022-09-22T02:47:31.803947279Z] [3580|3580] [debug] ModuleLoadFinished: 140150729284448 TestApplication.Shared AppDomain 94699831388896 [clrhost] | IsNGEN = false | IsDynamic = false | IsResource = false
[2022-09-22T02:47:31.804019179Z] [3580|3580] [debug] ModuleLoadFinished stored metadata for 140150729284448 TestApplication.Shared AppDomain 94699831388896 [clrhost]
[2022-09-22T02:47:31.804022379Z] [3580|3580] [debug] AssemblyLoadFinished: 94699845775072 0
[2022-09-22T02:47:31.804504578Z] [3580|3593] [info] Request ReJIT done for 1 methods
but there is no completion of the actul ReJIT, for example see an example of a successful run of the test:
[2022-09-22T03:03:24.426651900Z] [256760|195716] [debug] Enqueue for ReJIT [ModuleId=140704364384120, MethodDef=06000001, AppDomainId=2472867908960, IsDomainNeutral=0, Assembly=TestLibrary.InstrumentationTarget, Type=TestLibrary.InstrumentationTarget.Command, Method=Execute, Signature=200001]
[2022-09-22T03:03:24.426692800Z] [256760|195716] [debug] ModuleLoadFinished stored metadata for 140704364384120 TestLibrary.InstrumentationTarget AppDomain 2472867908960 [TestApplication.StrongNamed.exe]
[2022-09-22T03:03:24.426706600Z] [256760|195716] [debug] AssemblyLoadFinished: 2473312560592 0
[2022-09-22T03:03:24.427114000Z] [256760|259340] [info] Request ReJIT done for 1 methods
...
[2022-09-22T03:03:28.560054100Z] [256760|195716] [debug] GetReJITParameters: [moduleId: 140704364384120, methodId: 100663297]
[2022-09-22T03:03:28.560427700Z] [256760|195716] [debug] *** CallTarget_RewriterCallback() Start: TestLibrary.InstrumentationTarget.Command.Execute() [IsVoid=1, IsStatic=0, IntegrationType=OpenTelemetry.AutoInstrumentation.Instrumentations.Validations.StrongNamedValidation, Arguments=0]
[2022-09-22T03:03:28.560529800Z] [256760|195716] [debug] CallTargetTokens::EnsureBaseCalltargetTokens() Bytecode Instrumentation Assembly: OpenTelemetry.AutoInstrumentation, Version=0.3.1.0, Culture=neutral, PublicKeyToken=c0db600a13f60b51
[2022-09-22T03:03:28.560589800Z] [256760|195716] [debug] Caller Type.Id: 02000002
[2022-09-22T03:03:28.560598300Z] [256760|195716] [debug] Caller Type.IsGeneric: 0
[2022-09-22T03:03:28.560602600Z] [256760|195716] [debug] Caller Type.IsValid: 1
[2022-09-22T03:03:28.560607500Z] [256760|195716] [debug] Caller Type.Name: TestLibrary.InstrumentationTarget.Command
[2022-09-22T03:03:28.560612400Z] [256760|195716] [debug] Caller Type.TokenType: 33554432
[2022-09-22T03:03:28.560616200Z] [256760|195716] [debug] Caller Type.Spec: 0000001b
[2022-09-22T03:03:28.560620400Z] [256760|195716] [debug] Caller Type.ValueType: 0
[2022-09-22T03:03:28.560624500Z] [256760|195716] [debug] Caller Type Extend From.Id: 0c000001
[2022-09-22T03:03:28.560628900Z] [256760|195716] [debug] Caller Type Extend From.IsGeneric: 0
[2022-09-22T03:03:28.560633700Z] [256760|195716] [debug] Caller Type Extend From.IsValid: 1
[2022-09-22T03:03:28.560638000Z] [256760|195716] [debug] Caller Type Extend From.Name: System.Object
[2022-09-22T03:03:28.560643300Z] [256760|195716] [debug] Caller Type Extend From.TokenType: 16777216
[2022-09-22T03:03:28.560647600Z] [256760|195716] [debug] Caller Type Extend From.Spec: 0000001b
[2022-09-22T03:03:28.560652600Z] [256760|195716] [debug] Caller Type Extend From.ValueType: 0
[2022-09-22T03:03:28.560707000Z] [256760|195716] [info] *** CallTarget_RewriterCallback() Finished: TestLibrary.InstrumentationTarget.Command.Execute() [IsVoid=1, IsStatic=0, IntegrationType=OpenTelemetry.AutoInstrumentation.Instrumentations.Validations.StrongNamedValidation, Arguments=0]
[2022-09-22T03:03:28.560836300Z] [256760|195716] [debug] ReJITCompilationStarted: [functionId: 140704364385784, rejitId: 2, safeToBlock: 1]
[2022-09-22T03:03:28.588680300Z] [256760|195716] [debug] ReJITCompilationFinished: [functionId: 140704364385784, rejitId: 2, hrStatus: 0, safeToBlock: 1]
This seems to indicate that we need to ensure that the ReJIT completes before code to be instrumented is executed.
This seems to indicate that we need to ensure that the ReJIT completes before code to be instrumented is executed.
In other words, the startup time of a lightweight application might be too fast in time to time?
I simple modification to request the ReJIT in a synchronous fashion (directly on the thread handling ModuleLoadedFinsih) fixes the issue. However, I guess that this will have some perf impact, which, I didn't measure yet.
If the impact will be noticeable we could introduce an env var to control it. However, I would rather not do it as it is just about longer initialization.
Probably this one failed for the same reason https://github.com/open-telemetry/opentelemetry-dotnet-instrumentation/actions/runs/3128880953/jobs/5077270798
Yeah, I don't want to give the option to "perhaps a little fast" but some non-deterministic observability results. It seems that we have to offer a single and consistent way of instrumenting.