apm-agent-dotnet icon indicating copy to clipboard operation
apm-agent-dotnet copied to clipboard

Randomly not sending APM data to

Open dabaracus opened this issue 5 years ago • 15 comments

I have a number of .NetCore webservices that are using v1.1.0. They will send APM data to my server perfectly fine for a couple of days and then they will stop sending. The web services are still up and running, but i am not getting any data in my ELK stack. The only way to get the services to come back to sending data is to reboot the Windows box.

I have checked the windows event logs and there is nothing there, also my application logs that would indicate an issue.

dabaracus avatar Dec 11 '19 01:12 dabaracus

Thanks for reporting this.

The only theory I can came up with is that maybe the thread that sends the data to the APM Server from some reason dies. I can dig into the code to maybe verify this.

Although:

The only way to get the services to come back to sending data is to reboot the Windows box.

Does this mean that restarting the process itself does not solve the problem? If my theory is true then just restarting the process that you monitor would solve the issue - no windows reboot should be needed in that case.

gregkalapos avatar Dec 11 '19 12:12 gregkalapos

Closing for inactivity - feel free to reopen.

gregkalapos avatar Jan 28 '20 18:01 gregkalapos

Having the same issue with an ASP.NET Full Framework 4.7.2 application. Works for some days, but stops sending to APM after a while. Is also happens when a recycle of the APPPOOL occurs. In IIS, another recycle of the APPPOOL does not fixes it. Only explicitly stopping the website, and starting it again, will work. Since it's a highly demanding site, we can't restart it every time when it stops sending.. We're planning to migrate to .NET Core in the near future, but for now it's very annoying.. Any suggestions? Thanks in advance!

Cerebellum90 avatar Feb 05 '20 15:02 Cerebellum90

@Cerebellum90 do you maybe have some agent logs? I'd be especially interested in the last lines before the agent stops sending data.

gregkalapos avatar Feb 05 '20 15:02 gregkalapos

@gregkalapos currently we do not have explicitly set the ElasticApm:LogLevel in the web.config. But the documentation says it defaults to ERROR. Will it still log this way? If so, we do not see any logging on around the timestamps when it stopped. I can set it explicitly to TRACE for example. Will that be sufficient for the information you require? Also, everytime it stopped logging, we do see our own logging from the Application_Start event, so we do know the app restarted on that time (because of recycle or something else)

Cerebellum90 avatar Feb 05 '20 15:02 Cerebellum90

But the documentation says it defaults to ERROR. Will it still log this way?

In default - which you correctly say is error - the agent only logs errors. So, if you don't see anything in the logs, then probably there was no error and the agent silently dies (this is just a theory, but yeah, in default only errors are logs)

I can set it explicitly to TRACE for example. Will that be sufficient for the information you require?

It's hard to say if it'll be helpful or not - if there is no error, then probably all we'd see is that everything works fine and then after a while the agent silently dies. We can give this a try, such a log would be definitely useful, but may not be enough to find the root cause.

Also, everytime it stopped logging, we do see our own logging from the Application_Start event, so we do know the app restarted on that time (because of recycle or something else)

That's useful info - maybe this is the point when seeing some agent logs could help.

This gives me 2 theories:

  • 1 is that IIS kills all threads at some point, then it'll start them again and our agent initialization process does not cover this scenario
  • The next one is still (and this basically overlaps with the other theory) is that the thread that sends the data to the apm server is dead.

For the second, there would be a fairly easy way to verify - do you know process explorer or windbg?

Once the agent stops working, could you just dump all the threads from the process? !eestack in windbg.

In normal case you'll see lots of threads and in the latest agent version you'll see 2 with stack frames that contain Elastic.Apm., like this:

Thread  11
Current frame: ntdll!NtWaitForMultipleObjects + 0x14
Child-SP         RetAddr          Caller, Callee
000000b9ec6fdf40 00007ffda156c74e KERNELBASE!WaitForMultipleObjectsEx + 0xfe, calling ntdll!NtWaitForMultipleObjects
000000b9ec6fdfe0 00007ffda156c840 KERNELBASE!WaitForMultipleObjectsEx + 0x1f0, calling ntdll!RtlActivateActivationContextUnsafeFast
000000b9ec6fe030 00007ffd696a1b34 (MethodDesc 00007ffd69107c98 + 0x94 System.Threading.Tasks.TaskContinuation.InlineIfPossibleOrElseQueue(System.Threading.Tasks.Task, Boolean)), calling (MethodDesc 00007ffd69100fb8 + 0 System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean))
000000b9ec6fe080 00007ffd696426b4 (MethodDesc 00007ffd690fc428 + 0x1b4 System.Threading.Tasks.Task.RunContinuations(System.Object))
000000b9ec6fe0c0 00007ffd695ed11f (MethodDesc 00007ffd690f1380 + 0x1f System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()), calling coreclr!IL_Throw [e:\a\_work\725\s\src\vm\jithelpers.cpp:4790]
000000b9ec6fe150 00007ffd696a6f2d (MethodDesc 00007ffd69108970 + 0x8d System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].TrySetResult(System.Threading.Tasks.VoidTaskResult)), calling (MethodDesc 00007ffd690fc418 + 0 System.Threading.Tasks.Task.FinishContinuations())
000000b9ec6fe158 00007ffd6a23c862 coreclr!IL_Throw + 0x62 [e:\a\_work\725\s\src\vm\jithelpers.cpp:4802], calling coreclr!LazyMachStateCaptureState [E:\A\_work\725\s\src\vm\amd64\getstate.asm:62]
000000b9ec6fe190 00007ffd69687139 (MethodDesc 00007ffd691043d0 + 0x59 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].SetExistingTaskResult(System.Threading.Tasks.VoidTaskResult)), calling (MethodDesc 00007ffd69108970 + 0 System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].TrySetResult(System.Threading.Tasks.VoidTaskResult))
000000b9ec6fe1b0 00007ffd69688fb4 (MethodDesc 00007ffd69104798 + 0x44 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task)), calling (MethodDesc 00007ffd691047a8 + 0 System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(System.Threading.Tasks.Task))
000000b9ec6fe1d0 00007ffd69760e13 (MethodDesc 00007ffd69115280 + 0x43 System.Runtime.CompilerServices.AsyncTaskMethodBuilder.SetResult()), calling (MethodDesc 00007ffd691043d0 + 0 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].SetExistingTaskResult(System.Threading.Tasks.VoidTaskResult))
000000b9ec6fe1f0 00007ffda4de9880 ntdll!RtlSetLastWin32Error + 0x40, calling ntdll!_security_check_cookie
000000b9ec6fe210 00007ffd6a262e86 coreclr!Thread::GetFinalApartment + 0x12 [e:\a\_work\725\s\src\vm\threads.cpp:5129], calling coreclr!Thread::GetApartment [e:\a\_work\725\s\src\vm\threads.cpp:4981]
000000b9ec6fe240 00007ffd6a262b42 coreclr!Thread::DoAppropriateWaitWorker + 0x302 [e:\a\_work\725\s\src\vm\threads.cpp:3633], calling KERNEL32!WaitForMultipleObjectsEx
000000b9ec6fe350 00007ffd6a262805 coreclr!Thread::DoAppropriateWait + 0x89 [e:\a\_work\725\s\src\vm\threads.cpp:3307], calling coreclr!Thread::DoAppropriateWaitWorker [e:\a\_work\725\s\src\vm\threads.cpp:3555]
000000b9ec6fe3d0 00007ffd6a262689 coreclr!SyncBlock::Wait + 0x1e9 [e:\a\_work\725\s\src\vm\syncblk.cpp:3378], calling coreclr!Thread::DoAppropriateWait [e:\a\_work\725\s\src\vm\threads.cpp:3281]
000000b9ec6fe510 00007ffd6a26243b coreclr!ObjectNative::WaitTimeout + 0xeb [e:\a\_work\725\s\src\classlibnative\bcltype\objectnative.cpp:297], calling coreclr!SyncBlock::Wait [e:\a\_work\725\s\src\vm\syncblk.cpp:3281]
000000b9ec6fe5b0 00007ffd6a2ab988 coreclr!ThreadNative::YieldThread + 0x38 [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:1680], calling KERNEL32!SwitchToThreadStub
000000b9ec6fe600 00007ffd69679651 (MethodDesc 00007ffd691029d0 + 0x41 Internal.Runtime.Augments.RuntimeThread.get_OptimalMaxSpinWaitsPerSpinIteration()), calling coreclr!JIT_GetSharedNonGCStaticBaseNoCtor_SingleAppDomain [E:\A\_work\725\s\src\vm\amd64\JitHelpers_SingleAppDomain.asm:39]
000000b9ec6fe648 00007ffd6a2623d8 coreclr!ObjectNative::WaitTimeout + 0x88 [e:\a\_work\725\s\src\classlibnative\bcltype\objectnative.cpp:289], calling coreclr!LazyMachStateCaptureState [E:\A\_work\725\s\src\vm\amd64\getstate.asm:62]
000000b9ec6fe6a0 00007ffd6968838d (MethodDesc 00007ffd691045b0 + 0x8d System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken)), calling 00007ffd6a262350 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
000000b9ec6fe6b0 00007ffd69669f41 (MethodDesc 00007ffd69100f28 + 0xb1 System.Threading.SpinWait.SpinOnce(Int32)), calling 00007ffd6a2a70d0 (stub for System.Threading.Thread.SpinWaitInternal(Int32))
000000b9ec6fe6f0 00007ffd6968811d (MethodDesc 00007ffd691045a0 + 0x18d System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007ffd691045b0 + 0 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken))
000000b9ec6fe700 00007ffd0ac0e8ab (MethodDesc 00007ffd0ada08b0 + 0x4b Elastic.Apm.Logging.ScopedLogger.IsEnabled(Elastic.Apm.Logging.LogLevel)), calling 00007ffd0a6c2cc0
000000b9ec6fe790 00007ffd68af95f2 (MethodDesc 00007ffd0ae0f510 + 0x102 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource)), calling (MethodDesc 00007ffd691045a0 + 0 System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken))
000000b9ec6fe7d0 00007ffd68af89ea (MethodDesc 00007ffd0ae0f440 + 0x1a System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].get_IsCompleted()), calling (MethodDesc 00007ffd0ae0f658 + 0 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].CheckDisposed())
000000b9ec6fe800 00007ffd68afb1d7 (MethodDesc 00007ffd0ae271a0 + 0x107 System.Collections.Concurrent.BlockingCollection`1+<GetConsumingEnumerable>d__68[[System.__Canon, System.Private.CoreLib]].MoveNext()), calling (MethodDesc 00007ffd0ae0f510 + 0 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource))
000000b9ec6fe860 00007ffd0ac25070 (MethodDesc 00007ffd0adddda8 + 0x2e0 Elastic.Apm.Helpers.SingleThreadTaskScheduler.<ThreadMain>b__14_0()), calling 00007ffd0a6c3060
000000b9ec6fe9e0 00007ffd0ac24842 (MethodDesc 00007ffd0ae21268 + 0x172 Elastic.Apm.Helpers.ExceptionUtils.DoSwallowingExceptions(Elastic.Apm.Logging.IApmLogger, System.Action, Boolean, System.String))
000000b9ec6fec00 00007ffd0ac22e3c (MethodDesc 00007ffd0adddd58 + 0xbc Elastic.Apm.Helpers.SingleThreadTaskScheduler.ThreadMain()), calling (MethodDesc 00007ffd0ae21268 + 0 Elastic.Apm.Helpers.ExceptionUtils.DoSwallowingExceptions(Elastic.Apm.Logging.IApmLogger, System.Action, Boolean, System.String))
000000b9ec6fec70 00007ffd68177060 (MethodDesc 00007ffd0a9cc5b0 + 0x40 System.Threading.Thread.ThreadMain_ThreadStart())
000000b9ec6feca0 00007ffd6958d409 (MethodDesc 00007ffd690e2f90 + 0x89 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
000000b9ec6fed20 00007ffd6a2e3053 coreclr!CallDescrWorkerInternal + 0x83 [E:\A\_work\725\s\src\vm\amd64\CallDescrWorkerAMD64.asm:101]
000000b9ec6fed60 00007ffd6a1d2763 coreclr!MethodDescCallSite::CallTargetWorker + 0x273 [e:\a\_work\725\s\src\vm\callhelpers.cpp:628], calling coreclr!CallDescrWorkerInternal [E:\A\_work\725\s\src\vm\amd64\CallDescrWorkerAMD64.asm:38]
000000b9ec6fee00 00007ffd6a1d27f4 coreclr!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg + 0x2c [e:\a\_work\725\s\src\vm\callingconvention.h:330], calling coreclr!MetaSig::GetReturnTypeNormalized [e:\a\_work\725\s\src\vm\siginfo.cpp:5213]
000000b9ec6fee10 00007ffd6a1f4ca3 coreclr!MethodDesc::GetCallTarget + 0x6b [e:\a\_work\725\s\src\vm\method.cpp:2225], calling coreclr!MethodDesc::GetSingleCallableAddrOfVirtualizedCode [e:\a\_work\725\s\src\vm\method.cpp:1980]
000000b9ec6feea0 00007ffd6a2d18e1 coreclr!ThreadNative::KickOffThread_Worker + 0x131 [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:260], calling coreclr!MethodDescCallSite::CallTargetWorker [e:\a\_work\725\s\src\vm\callhelpers.cpp:333]
000000b9ec6feed0 00007ffda3cc7cca combase!TLSAddToMap + 0xa2 [onecore\com\combase\class\tls.cxx:180], calling ntdll!RtlLeaveCriticalSection
000000b9ec6ff010 00007ffd6a1d2332 coreclr!ManagedThreadBase_DispatchMiddle + 0x176 [e:\a\_work\725\s\src\vm\threads.cpp:8901], calling ntdll!LdrpDispatchUserCallTarget
000000b9ec6ff030 00007ffda3cdc0e6 combase!CoInitializeEx + 0x36 [onecore\com\combase\class\compobj.cxx:3949], calling combase!_CoInitializeEx [onecore\com\combase\class\compobj.cxx:3708]
000000b9ec6ff0c0 00007ffda1217c4b ucrtbase!controlfp_s + 0x5b, calling ucrtbase!common_control87
000000b9ec6ff0d0 00007ffd6a20bf43 coreclr!Thread::SetDebugCantStop + 0x2b [e:\a\_work\725\s\src\vm\threads.cpp:7616], calling coreclr!ClrFlsIncrementValue [e:\a\_work\725\s\src\inc\clrhost.h:87]
000000b9ec6ff130 00007ffd6a20f1b3 coreclr!ETW::ThreadLog::FireThreadCreated + 0x43 [e:\a\_work\725\s\src\vm\eventtrace.cpp:2655], calling coreclr!FireEtwThreadCreated [e:\a\_work\725\s\bin\obj\windows_nt.x64.release\src\inc\clretwallmain.h:1326]
000000b9ec6ff150 00007ffd6a1d2127 coreclr!ManagedThreadBase_DispatchOuter + 0xaf [e:\a\_work\725\s\src\vm\threads.cpp:9140], calling coreclr!ManagedThreadBase_DispatchMiddle [e:\a\_work\725\s\src\vm\threads.cpp:8855]
000000b9ec6ff170 00007ffd6a20f112 coreclr!Thread::HasStarted + 0xfa [e:\a\_work\725\s\src\vm\threads.cpp:1991], calling coreclr!CLRException::HandlerState::CleanupTry [e:\a\_work\725\s\src\vm\clrex.cpp:901]
000000b9ec6ff1f0 00007ffd6a2dbf5f coreclr!ManagedThreadBase_FullTransitionWithAD + 0x2f [e:\a\_work\725\s\src\vm\threads.cpp:9200], calling coreclr!ManagedThreadBase_DispatchOuter [e:\a\_work\725\s\src\vm\threads.cpp:9092]
000000b9ec6ff250 00007ffd6a20acdc coreclr!ThreadNative::KickOffThread + 0xfc [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:380], calling coreclr!ManagedThreadBase_FullTransitionWithAD [e:\a\_work\725\s\src\vm\threads.cpp:9189]
000000b9ec6ff330 00007ffd6a20abba coreclr!Thread::intermediateThreadProc + 0x8a [e:\a\_work\725\s\src\vm\threads.cpp:2255], calling ntdll!LdrpDispatchUserCallTarget
000000b9ec6ff730 00007ffd6a20ab97 coreclr!Thread::intermediateThreadProc + 0x67 [e:\a\_work\725\s\src\vm\threads.cpp:2250], calling coreclr!__chkstk [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\misc\amd64\chkstk.asm:67]
000000b9ec6ff770 00007ffda4257974 KERNEL32!BaseThreadInitThunk + 0x14, calling ntdll!LdrpDispatchUserCallTarget
000000b9ec6ff7a0 00007ffda4dfa271 ntdll!RtlUserThreadStart + 0x21, calling ntdll!LdrpDispatchUserCallTarget
---------------------------------------------
Thread  12
Current frame: ntdll!NtDelayExecution + 0x14
Child-SP         RetAddr          Caller, Callee
000000b9ec73faa0 00007ffda15496f1 KERNELBASE!SleepEx + 0xa1, calling ntdll!NtDelayExecution
000000b9ec73fb10 00007ffda1549764 KERNELBASE!SleepEx + 0x114, calling ntdll!RtlActivateActivationContextUnsafeFast
000000b9ec73fb40 00007ffd6a2b7c8d coreclr!ThreadpoolMgr::TimerThreadFire + 0x5d [e:\a\_work\725\s\src\vm\win32threadpool.cpp:4641], calling KERNEL32!SleepEx
000000b9ec73fbe0 00007ffd6a2b7c2c coreclr!ThreadpoolMgr::TimerThreadStart + 0x6c [e:\a\_work\725\s\src\vm\win32threadpool.cpp:4613], calling coreclr!ThreadpoolMgr::TimerThreadFire [e:\a\_work\725\s\src\vm\win32threadpool.cpp:4628]
000000b9ec73fc10 00007ffda4257974 KERNEL32!BaseThreadInitThunk + 0x14, calling ntdll!LdrpDispatchUserCallTarget
000000b9ec73fc40 00007ffda4dfa271 ntdll!RtlUserThreadStart + 0x21, calling ntdll!LdrpDispatchUserCallTarget
---------------------------------------------
Thread  16
Current frame: ntdll!NtWaitForMultipleObjects + 0x14
Child-SP         RetAddr          Caller, Callee
000000b9eca3e1d0 00007ffda156c74e KERNELBASE!WaitForMultipleObjectsEx + 0xfe, calling ntdll!NtWaitForMultipleObjects
000000b9eca3e1e0 00007ffd6a25062e coreclr!GCHandleStore::CreateHandleOfType + 0x12e [e:\a\_work\725\s\src\gc\gchandletable.cpp:32], calling coreclr!StressLog::LogOn [e:\a\_work\725\s\src\utilcode\stresslog.cpp:624]
000000b9eca3e270 00007ffda156c840 KERNELBASE!WaitForMultipleObjectsEx + 0x1f0, calling ntdll!RtlActivateActivationContextUnsafeFast
000000b9eca3e288 00007ffda1549159 KERNELBASE!RaiseException + 0x69, calling ntdll!RtlRaiseException
000000b9eca3e2c0 00007ffd6958d409 (MethodDesc 00007ffd690e2f90 + 0x89 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
000000b9eca3e300 00007ffd6a23d175 coreclr!GetHRFromThrowable + 0x15 [e:\a\_work\725\s\src\vm\excep.cpp:2846], calling coreclr!IsException [e:\a\_work\725\s\src\vm\excep.cpp:8856]
000000b9eca3e330 00007ffd6a23cfb2 coreclr!RaiseTheExceptionInternalOnly + 0x28e [e:\a\_work\725\s\src\vm\excep.cpp:2996], calling KERNEL32!RaiseExceptionStub
000000b9eca3e3a0 00007ffd69640ca3 (MethodDesc 00007ffd690fc200 + 0x153 System.Threading.Tasks.Task.FinishStageTwo()), calling coreclr!JIT_GetSharedNonGCStaticBase_SingleAppDomain [E:\A\_work\725\s\src\vm\amd64\JitHelpers_SingleAppDomain.asm:26]
000000b9eca3e3c0 00007ffd6a1cfca9 coreclr!AllocatePrimitiveArray + 0x29 [e:\a\_work\725\s\src\vm\gchelpers.cpp:733], calling coreclr!TypeDesc::GetMethodTable [e:\a\_work\725\s\src\vm\typedesc.inl:18]
000000b9eca3e460 00007ffd6a23c92e coreclr!IL_Throw + 0x12e [e:\a\_work\725\s\src\vm\jithelpers.cpp:4860], calling coreclr!RaiseTheExceptionInternalOnly [e:\a\_work\725\s\src\vm\excep.cpp:2856]
000000b9eca3e490 00007ffd69641174 (MethodDesc 00007ffd690fc270 + 0x84 System.Threading.Tasks.Task.ExecuteEntry()), calling (MethodDesc 00007ffd690fc2a0 + 0 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef))
000000b9eca3e4a0 00007ffd6a262e86 coreclr!Thread::GetFinalApartment + 0x12 [e:\a\_work\725\s\src\vm\threads.cpp:5129], calling coreclr!Thread::GetApartment [e:\a\_work\725\s\src\vm\threads.cpp:4981]
000000b9eca3e4d0 00007ffd6a262b42 coreclr!Thread::DoAppropriateWaitWorker + 0x302 [e:\a\_work\725\s\src\vm\threads.cpp:3633], calling KERNEL32!WaitForMultipleObjectsEx
000000b9eca3e5e0 00007ffd6a262805 coreclr!Thread::DoAppropriateWait + 0x89 [e:\a\_work\725\s\src\vm\threads.cpp:3307], calling coreclr!Thread::DoAppropriateWaitWorker [e:\a\_work\725\s\src\vm\threads.cpp:3555]
000000b9eca3e660 00007ffd6a262689 coreclr!SyncBlock::Wait + 0x1e9 [e:\a\_work\725\s\src\vm\syncblk.cpp:3378], calling coreclr!Thread::DoAppropriateWait [e:\a\_work\725\s\src\vm\threads.cpp:3281]
000000b9eca3e7a0 00007ffd6a26243b coreclr!ObjectNative::WaitTimeout + 0xeb [e:\a\_work\725\s\src\classlibnative\bcltype\objectnative.cpp:297], calling coreclr!SyncBlock::Wait [e:\a\_work\725\s\src\vm\syncblk.cpp:3281]
000000b9eca3e840 00007ffd6a2ab988 coreclr!ThreadNative::YieldThread + 0x38 [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:1680], calling KERNEL32!SwitchToThreadStub
000000b9eca3e890 00007ffd69679651 (MethodDesc 00007ffd691029d0 + 0x41 Internal.Runtime.Augments.RuntimeThread.get_OptimalMaxSpinWaitsPerSpinIteration()), calling coreclr!JIT_GetSharedNonGCStaticBaseNoCtor_SingleAppDomain [E:\A\_work\725\s\src\vm\amd64\JitHelpers_SingleAppDomain.asm:39]
000000b9eca3e8d8 00007ffd6a2623d8 coreclr!ObjectNative::WaitTimeout + 0x88 [e:\a\_work\725\s\src\classlibnative\bcltype\objectnative.cpp:289], calling coreclr!LazyMachStateCaptureState [E:\A\_work\725\s\src\vm\amd64\getstate.asm:62]
000000b9eca3e930 00007ffd6968838d (MethodDesc 00007ffd691045b0 + 0x8d System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken)), calling 00007ffd6a262350 (stub for System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object))
000000b9eca3e940 00007ffd69669f41 (MethodDesc 00007ffd69100f28 + 0xb1 System.Threading.SpinWait.SpinOnce(Int32)), calling 00007ffd6a2a70d0 (stub for System.Threading.Thread.SpinWaitInternal(Int32))
000000b9eca3e980 00007ffd6968811d (MethodDesc 00007ffd691045a0 + 0x18d System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken)), calling (MethodDesc 00007ffd691045b0 + 0 System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken))
000000b9eca3e990 00007ffd0ac0e8ab (MethodDesc 00007ffd0ada08b0 + 0x4b Elastic.Apm.Logging.ScopedLogger.IsEnabled(Elastic.Apm.Logging.LogLevel)), calling 00007ffd0a6c2cc0
000000b9eca3ea20 00007ffd68af95f2 (MethodDesc 00007ffd0ae0f510 + 0x102 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource)), calling (MethodDesc 00007ffd691045a0 + 0 System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken))
000000b9eca3ea60 00007ffd68af89ea (MethodDesc 00007ffd0ae0f440 + 0x1a System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].get_IsCompleted()), calling (MethodDesc 00007ffd0ae0f658 + 0 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].CheckDisposed())
000000b9eca3ea90 00007ffd68afb1d7 (MethodDesc 00007ffd0ae271a0 + 0x107 System.Collections.Concurrent.BlockingCollection`1+<GetConsumingEnumerable>d__68[[System.__Canon, System.Private.CoreLib]].MoveNext()), calling (MethodDesc 00007ffd0ae0f510 + 0 System.Collections.Concurrent.BlockingCollection`1[[System.__Canon, System.Private.CoreLib]].TryTakeWithNoTimeValidation(System.__Canon ByRef, Int32, System.Threading.CancellationToken, System.Threading.CancellationTokenSource))
000000b9eca3eaf0 00007ffd0ac25070 (MethodDesc 00007ffd0adddda8 + 0x2e0 Elastic.Apm.Helpers.SingleThreadTaskScheduler.<ThreadMain>b__14_0()), calling 00007ffd0a6c3060
000000b9eca3ec70 00007ffd0ac24842 (MethodDesc 00007ffd0ae21268 + 0x172 Elastic.Apm.Helpers.ExceptionUtils.DoSwallowingExceptions(Elastic.Apm.Logging.IApmLogger, System.Action, Boolean, System.String))
000000b9eca3ee90 00007ffd0ac22e3c (MethodDesc 00007ffd0adddd58 + 0xbc Elastic.Apm.Helpers.SingleThreadTaskScheduler.ThreadMain()), calling (MethodDesc 00007ffd0ae21268 + 0 Elastic.Apm.Helpers.ExceptionUtils.DoSwallowingExceptions(Elastic.Apm.Logging.IApmLogger, System.Action, Boolean, System.String))
000000b9eca3ef00 00007ffd68177060 (MethodDesc 00007ffd0a9cc5b0 + 0x40 System.Threading.Thread.ThreadMain_ThreadStart())
000000b9eca3ef30 00007ffd6958d409 (MethodDesc 00007ffd690e2f90 + 0x89 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object))
000000b9eca3efb0 00007ffd6a2e3053 coreclr!CallDescrWorkerInternal + 0x83 [E:\A\_work\725\s\src\vm\amd64\CallDescrWorkerAMD64.asm:101]
000000b9eca3eff0 00007ffd6a1d2763 coreclr!MethodDescCallSite::CallTargetWorker + 0x273 [e:\a\_work\725\s\src\vm\callhelpers.cpp:628], calling coreclr!CallDescrWorkerInternal [E:\A\_work\725\s\src\vm\amd64\CallDescrWorkerAMD64.asm:38]
000000b9eca3f090 00007ffd6a1d27f4 coreclr!ArgIteratorTemplate<ArgIteratorBase>::HasRetBuffArg + 0x2c [e:\a\_work\725\s\src\vm\callingconvention.h:330], calling coreclr!MetaSig::GetReturnTypeNormalized [e:\a\_work\725\s\src\vm\siginfo.cpp:5213]
000000b9eca3f0a0 00007ffd6a1f4ca3 coreclr!MethodDesc::GetCallTarget + 0x6b [e:\a\_work\725\s\src\vm\method.cpp:2225], calling coreclr!MethodDesc::GetSingleCallableAddrOfVirtualizedCode [e:\a\_work\725\s\src\vm\method.cpp:1980]
000000b9eca3f130 00007ffd6a2d18e1 coreclr!ThreadNative::KickOffThread_Worker + 0x131 [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:260], calling coreclr!MethodDescCallSite::CallTargetWorker [e:\a\_work\725\s\src\vm\callhelpers.cpp:333]
000000b9eca3f160 00007ffda3cc7cca combase!TLSAddToMap + 0xa2 [onecore\com\combase\class\tls.cxx:180], calling ntdll!RtlLeaveCriticalSection
000000b9eca3f2a0 00007ffd6a1d2332 coreclr!ManagedThreadBase_DispatchMiddle + 0x176 [e:\a\_work\725\s\src\vm\threads.cpp:8901], calling ntdll!LdrpDispatchUserCallTarget
000000b9eca3f2c0 00007ffda3cdc0e6 combase!CoInitializeEx + 0x36 [onecore\com\combase\class\compobj.cxx:3949], calling combase!_CoInitializeEx [onecore\com\combase\class\compobj.cxx:3708]
000000b9eca3f350 00007ffda1217c4b ucrtbase!controlfp_s + 0x5b, calling ucrtbase!common_control87
000000b9eca3f360 00007ffd6a20bf43 coreclr!Thread::SetDebugCantStop + 0x2b [e:\a\_work\725\s\src\vm\threads.cpp:7616], calling coreclr!ClrFlsIncrementValue [e:\a\_work\725\s\src\inc\clrhost.h:87]
000000b9eca3f3c0 00007ffd6a20f1b3 coreclr!ETW::ThreadLog::FireThreadCreated + 0x43 [e:\a\_work\725\s\src\vm\eventtrace.cpp:2655], calling coreclr!FireEtwThreadCreated [e:\a\_work\725\s\bin\obj\windows_nt.x64.release\src\inc\clretwallmain.h:1326]
000000b9eca3f3e0 00007ffd6a1d2127 coreclr!ManagedThreadBase_DispatchOuter + 0xaf [e:\a\_work\725\s\src\vm\threads.cpp:9140], calling coreclr!ManagedThreadBase_DispatchMiddle [e:\a\_work\725\s\src\vm\threads.cpp:8855]
000000b9eca3f400 00007ffd6a20f112 coreclr!Thread::HasStarted + 0xfa [e:\a\_work\725\s\src\vm\threads.cpp:1991], calling coreclr!CLRException::HandlerState::CleanupTry [e:\a\_work\725\s\src\vm\clrex.cpp:901]
000000b9eca3f480 00007ffd6a2dbf5f coreclr!ManagedThreadBase_FullTransitionWithAD + 0x2f [e:\a\_work\725\s\src\vm\threads.cpp:9200], calling coreclr!ManagedThreadBase_DispatchOuter [e:\a\_work\725\s\src\vm\threads.cpp:9092]
000000b9eca3f4e0 00007ffd6a20acdc coreclr!ThreadNative::KickOffThread + 0xfc [e:\a\_work\725\s\src\vm\comsynchronizable.cpp:380], calling coreclr!ManagedThreadBase_FullTransitionWithAD [e:\a\_work\725\s\src\vm\threads.cpp:9189]
000000b9eca3f510 00007ffda4e05782 ntdll!_LdrpInitialize + 0xa6, calling ntdll!NtTestAlert
000000b9eca3f5c0 00007ffd6a20abba coreclr!Thread::intermediateThreadProc + 0x8a [e:\a\_work\725\s\src\vm\threads.cpp:2255], calling ntdll!LdrpDispatchUserCallTarget
000000b9eca3fa40 00007ffd6a20ab97 coreclr!Thread::intermediateThreadProc + 0x67 [e:\a\_work\725\s\src\vm\threads.cpp:2250], calling coreclr!__chkstk [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\misc\amd64\chkstk.asm:67]
000000b9eca3fa80 00007ffda4257974 KERNEL32!BaseThreadInitThunk + 0x14, calling ntdll!LdrpDispatchUserCallTarget
000000b9eca3fab0 00007ffda4dfa271 ntdll!RtlUserThreadStart + 0x21, calling ntdll!LdrpDispatchUserCallTarget
---------------------------------------------
Thread   9
Current frame: ntdll!NtWaitForWorkViaWorkerFactory + 0x14
Child-SP         RetAddr          Caller, Callee
000000b9ebf7f6f0 00007ffda4de01f4 ntdll!TppWorkerThread + 0x274, calling ntdll!NtWaitForWorkViaWorkerFactory
000000b9ebf7f9e0 00007ffda4257974 KERNEL32!BaseThreadInitThunk + 0x14, calling ntdll!LdrpDispatchUserCallTarget
000000b9ebf7fa10 00007ffda4dfa271 ntdll!RtlUserThreadStart + 0x21, calling ntdll!LdrpDispatchUserCallTarget
---------------------------------------------

Just search for Elastic.Apm - there should be 2 threads containing frames from that namespace. One is the thread that fetches central config data (less important for our discussion), the other one is the one that sends data to APM server (my theory is that this one dies - or is not started after an AppPool recycle). Unfortunately I guess you won't be able to easily distinguish the 2 (that's due how our code is structured 🤦‍♂ ), but both work the same, so I guess either both will die or none of them.

So, could you just dump the threads after the agent stops sending data and see if those threads are still there - or I should say verify that they died. It'd be nice to first dump the threads while the agent works and make sure you can spot the 2 agent threads - for this you need to have a proper config with symbols configured.

gregkalapos avatar Feb 06 '20 11:02 gregkalapos

@gregkalapos First of all, many thanks for such a detailed way of providing thread-dumps, etc. First a dump when it was still working; threaddump_workingapm.txt

Then a dump when it stopped posting to APM (after an APPPOOL recycle): threaddump_nonworkingapm.txt

This still shows threads with details of Elastic.Apm in it, but instead of 15 rows that contains Elastic.Apm, it had 14 matching rows after the recycle in our case.

I hope the provided information is helpful. In case you need more information, please let me know. Thanks in advance!

Cerebellum90 avatar Feb 06 '20 19:02 Cerebellum90

@gregkalapos hope you're doing well! Any new information regarding our issues? We're still experiencing restart-problems of the APM agent after a publish of our website. The only way to bring the APM logging back to life, is to recycle/(stop/start) the apppool for a random amount of times until it starts agaain

Cerebellum90 avatar Mar 04 '20 08:03 Cerebellum90

We are experiencing this problem with several .Net apps, is anyone looking into this?

cjensen1-godaddy avatar May 12 '22 20:05 cjensen1-godaddy

I'm looking into this.

In version 1.15.0 we made some change around threading in the component that sends data to APM Server.

This is a fairly old issue - could someone comment if they have the same bug with 1.15.0 or newer? Just want to make sure the problem is still there after changes related to this area.

I'm looking at the useful threaddumps. If someone could attach agent logs that'd help as well.

Investigation continues...

gregkalapos avatar Jul 04 '22 10:07 gregkalapos

Hi @gregkalapos, for us the bug seems to be gone for a while now. I guess since 5 releases or so. Also, our application has had some refactoring and the servers have been upgraded in the meantime. Our guess is that is was more of a resource/timing related bug in combination with APM, instead of actually what version of the APM library.

Cerebellum90 avatar Jul 05 '22 06:07 Cerebellum90

We'll upgrade to the latest and try it out. The issue happens after some time of the application being up, and we hypothesize, a temporary network issue causes the APM client connection to the server to get into a bad state, so only time will tell if it is fixed. In other words, we don't have a solid repro. We can also experiment with attempting to force a network interruption in a sandbox environment, but not sure if that will work.

cjensen1-godaddy avatar Jul 05 '22 20:07 cjensen1-godaddy

Thanks for the useful updates!

we hypothesize, a temporary network issue causes the APM client connection to the server to get into a bad state

The agent uses HttpClient and we handle all sorts of exceptions. Of course we can't rule this out and it's definitely good to check this. On the other hand, in this case, I'd also expect other components using HttpClient to fail. Definitely don't want to state any conclusion, just adding my thoughts on this.

gregkalapos avatar Jul 05 '22 21:07 gregkalapos

Yeah, we could be wrong. Circumstantially, one of the instances coincided with the temporary failure of a network interface that only applied to outbound traffic. The application came back with the exception of the APM client connection after the controller issue was resolved.

cjensen1-godaddy avatar Jul 05 '22 21:07 cjensen1-godaddy

We now have all our applications on 1.16.1 and will monitor to see whether the issue recurs.

cjensen1-godaddy avatar Jul 12 '22 00:07 cjensen1-godaddy

I think this is fixed - if not feel free to hit "repoen" and let us know the details.

gregkalapos avatar Sep 29 '22 09:09 gregkalapos