cpp_client_telemetry icon indicating copy to clipboard operation
cpp_client_telemetry copied to clipboard

ResumeTransmission stuck on lock

Open thomasameisel opened this issue 3 years ago • 11 comments

Describe your environment. Describe any aspect of your environment relevant to the problem, including your SDK version, platform, OS version, etc. If you're reporting a problem with a specific version of a library in this repo, please check whether the problem has been fixed on main brach.

iOS platform, SDK version 3.6.187

Steps to reproduce. Describe exactly how to reproduce the error. Include a code sample if applicable.

Call ODWLogManager.ResumeTransmission. The issue was reported as happening on boot, but it is unclear if that is necessary.

What is the expected behavior? What did you expect to see?

ResumeTransmission executes successfully.

What is the actual behavior? What did you see instead?

ResumeTransmission waits for a lock to be released until the app is killed as non-responsive.

Additional context. Add any other context about the problem here.

Stack trace:

AC24C470-6473-4D8C-8A96-18E7BC0D03C9

thomasameisel avatar Nov 28 '22 19:11 thomasameisel

@thomasameisel Do you have the stack trace for all other threads at the time when Thread1 was waiting for the lock ?

lalitb avatar Nov 30 '22 05:11 lalitb

@lalitb we don't have the stack trace for the other threads unfortunately

thomasameisel avatar Nov 30 '22 19:11 thomasameisel

Thanks @thomasameisel the other thread stack would have given more insight of any deadlock situation or if other thread has invoked any LogManager operation which is taking too much of time.

lalitb avatar Nov 30 '22 21:11 lalitb

@lalitb We are facing a similar lock issue on pauseTransmission

Thread 41: triggered +[ODWLogManager pauseTransmission] and is sitting in 1DS lock, probably for a long time.

Screenshot 2023-02-15 at 11 41 39 AM

Attaching the crash reports here.

report-2517258755120939999-2c2491df-77e7-4a01-9e0b-15b3ee6faef7.txt TeamSpaceApp 2-9-23, 1-24 PM.txt

On dispatch of Pause transmission request, this acquires the lock and waits for http request cancelation and never releases.

Thread 38 name:   Dispatch queue: eventDispatchQueue
Thread 38:
0   libsystem_kernel.dylib        	       0x1c8cccdfc swtch_pri + 8
1   libsystem_pthread.dylib       	       0x1d943673c cthread_yield + 32
2   TeamSpaceApp                  	       0x10838b5f8 Microsoft::Applications::Events::HttpClientManager::cancelAllRequests() + 44
3   TeamSpaceApp                  	       0x1083def24 std::__1::__function::__func<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2, std::__1::allocator<Microsoft::Applications::Events::TelemetrySystem::TelemetrySystem(Microsoft::Applications::Events::ILogManager&, Microsoft::Applications::Events::IRuntimeConfig&, Microsoft::Applications::Events::IOfflineStorage&, Microsoft::Applications::Events::IHttpClient&, Microsoft::Applications::Events::ITaskDispatcher&, Microsoft::Applications::Events::IBandwidthController*, Microsoft::Applications::Events::LogSessionDataProvider&)::$_2>, bool ()>::operator()() + 60
4   TeamSpaceApp                  	       0x1083a5afc Microsoft::Applications::Events::LogManagerImpl::PauseTransmission() + 128
5   TeamSpaceApp                  	       0x1083bb80c Microsoft::Applications::Events::LogManagerBase<Microsoft::Applications::Events::ModuleLogConfiguration>::PauseTransmission() + 84
6   TeamSpaceApp                  	       0x1083bb718 +[ODWLogManager pauseTransmission] + 20
7   TeamSpaceApp                  	       0x10a1d9294 TSOneDSTelemetryLogManager.pauseTransmission() + 256
8   TeamSpaceApp                  	       0x10a1d9348 @objc TSOneDSTelemetryLogManager.pauseTransmission() + 36
9   TeamSpaceApp                  	       0x1091e16a4 __46-[AXPInstrumentationManager pauseTransmission]_block_invoke + 136
10  TeamSpaceApp                  	       0x10c7bf56c 0x102b08000 + 164328812
11  libdispatch.dylib             	       0x1927cf460 _dispatch_call_block_and_release + 32
12  libdispatch.dylib             	       0x1927d0f88 _dispatch_client_callout + 20
13  libdispatch.dylib             	       0x1927d8640 _dispatch_lane_serial_drain + 672
14  libdispatch.dylib             	       0x1927d918c _dispatch_lane_invoke + 384
15  libdispatch.dylib             	       0x1927e3e10 _dispatch_workloop_worker_thread + 652
16  libsystem_pthread.dylib       	       0x1d9430df8 _pthread_wqthread + 288
17  libsystem_pthread.dylib       	       0x1d9430b98 start_wqthread + 8


nishchith-cp avatar Feb 15 '23 06:02 nishchith-cp

@lalitb Any updates on this? Could you please prioritize this? Let me know if you need anything else. Here is another crash log. TeamSpaceApp 3-1-23, 1-44 PM.txt

nishchith-cp avatar Mar 03 '23 04:03 nishchith-cp

@lalitb Any updates on this? We are hitting into this quite often. Could you please check on this

nishchith-cp avatar Mar 10 '23 05:03 nishchith-cp

@nishchith-cp - Is it possible to get the stack trace of all other threads, not just the thread crashing with timeout. There is a deadlock scenario between threads, so the data would be helpful.

lalitb avatar Mar 10 '23 08:03 lalitb

Already attached the crash log in the preview comment TeamSpaceApp.2-9-23.1-24.PM (1).txt

nishchith-cp avatar Mar 10 '23 14:03 nishchith-cp

@lalitb Could you share an update on the same?

nishchith-cp avatar Mar 27 '23 13:03 nishchith-cp

@lalitb here's a crash log with the PauseTransmission issue - report-2517068873866699999-59e56560-7cb3-4c83-9343-b9e8ff905328 (1).txt

From the call stack, I noticed the PauseTransmission function is synchronously waiting for the HTTP requests to complete. I'm curious on the need to wait for these network requests? By waiting on the requests to complete, PauseTransmission is also waiting to release the m_lock mutex which make other functions (ex. GetLogger) seem like they're hanging since they're waiting on that mutex.

thomasameisel avatar Oct 17 '23 18:10 thomasameisel

@lalitb We are running into this issue again and we are facing hangs as part of this and getting lot of complaints. Could you please help priorotizing this. We are observing this mainly when there are lot of telemetry events queued.

nishchith-cp avatar Mar 20 '25 13:03 nishchith-cp