microsoft-graph-comms-samples icon indicating copy to clipboard operation
microsoft-graph-comms-samples copied to clipboard

Random Media Platform Overloaded status

Open rvleeuwen256 opened this issue 1 year ago • 3 comments

We have built a multi-tenant solution using the Media platform that handles roughly 10000 calls per day with audio and H.264 video. We are very pleased by the stability of the platform. However since the platform has moved from using WCF to HTTP (version 1.28) it reports the Overloaded status seemingly random multiple times per day. This is quite problematic because in this status the platform cannot handle new calls. Investigation showed that this overload is not related to high CPU or network traffic, because we have seen cases where the platform was overloaded while there was only a single active call. When we run a load-test on our solution we cannot trigger the overload. We have added additional logging in all callbacks from the MediaPlatform to ensure we handle the callbacks from the MediaPlatform very quickly (<10 ms) and we always return from the callback. The issue happens both when running on .Net Framework 4.7.2. as well as on .Net 8.

We have enabled the logging of the Media platform and we do see some disturbing error messages. There seems to be a relation between this random overload and the message 'thread <threadId> considered stuck or inactive'. Is it possible the development team of the Media platform can have a look at these logs?

Microsoft.Skype.Bots.Media Version 1.29.0.75-preview Microsoft.Graph.Communications.Calls.Media 1.2.0.10563 Microsoft.Graph 5.54.0 Microsoft.Graph.Core 3.1.11

rvleeuwen256 avatar Dec 12 '24 21:12 rvleeuwen256

@rvleeuwen256 Could you please share the following info?

  1. What kind of hardware/SKU are you running it on?
  2. What is your bot's application ID?
  3. Can you also share the MediaPlatform logs that you have currently enabled during the time it reports overloaded?

Depending on what we find from the logs you shared we may increase the logging verbosity to debug further at which point you may have to rerun your tests. Hope that's ok.

adityaramgopal avatar Dec 13 '24 17:12 adityaramgopal

Hi @rvleeuwen256 If you don't want to share the bot application ID, then a recent call ID would also work.

ssulzer avatar Dec 13 '24 19:12 ssulzer

We are indeed careful with providing the AppId of the Bot. The service is running on a Cloud Service Extended Support on many Standard_D3_v2 machines. We have scaled both horizontally as well as vertically, but this did not make any difference.

An example of a 2 overlapping calls that triggered the Media overload is the following:

Teams Call Id: 34003b80-ddc7-47df-8231-adc02eda3399 Call Chain Id: c41d5539-43e8-437c-8ee7-5fa6042aa21d Media session Id: 28054290-800c-400a-982b-e9fd972845d7 Start: 2024-12-12 16:34:57.982 Established: 2024-12-12 16:35:01.402 End: 2024-12-12 16:37:01.677

Teams Call Id: 64003a80-bda7-490d-9629-c9acb1346ce5 Call Chain Id: f91809e9-8d66-48da-bda9-6508274b1e18, Media session Id: e4e2d290-8593-4780-9383-c6c5c608e275 Start: 2024-12-12 16:36:29.249 Established: 16:36:31.586 End: 2024-12-12 16:37:37.276

At 2024-12-12 16:37:21.210: Media platform health changed from 'Normal' to 'HeavilyLoaded'. At 2024-12-12 16:38:04.206: Media platform health changed from 'HeavilyLoaded' to 'Overloaded'.

Interesting fact is that the overload happens while there was no active call anymore.

The Media platform logs are way too big (even with just Information) to share here on GitHub. These warnings were found in the Media platform logs after the call was finished.

4800121 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD0 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110097 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 33 considered stuck or inactive 4800122 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD1 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110098 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 41 considered stuck or inactive 4800123 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD2 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.00110099 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 57 considered stuck or inactive 4800124 2024-12-12 16:37:41.039 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0019ABD3 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.239::12/12/2024-16:37:41.039.0011009A (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 65 considered stuck or inactive 4802246 2024-12-12 16:38:11.040 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.0019AC16 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.001100B6 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 62 considered stuck or inactive 4802247 2024-12-12 16:38:11.040 Warning [MediaPlatform] [AvMP][AppId:<AppId>][RD28187857DFEF] TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.0019AC17 (MPAZAPPHOST,LogWarn:AzAppMPHostLogger.cs(96)) TL_WARN(TF_COMPONENT) [RD28187857DFEF]3972.202::12/12/2024-16:38:11.040.001100B7 (AVMP,HeartBeatCallback:workitemqueue.cs(162)) [MP] ThreadPool: thread 63 considered stuck or inactive

There are no errors or warnings that have the media session id in the logging. We do see many errors like 'GetAllChannelsQoe: context[4cbf23cc-b118-48a0-a95a-7f41a9423618] Endpoints not found'.

If you need more information please let us know.

rvleeuwen256 avatar Dec 16 '24 15:12 rvleeuwen256

After a long investigation we were able to narrow the issue down. This issue is starting to happen when the bot-process starts child-processes. It does not matter what kind of child-process is started and it also does not matter whether we redirect the input and output streams or not. We have filed a bug at Microsoft for further investigation.

rvleeuwen256 avatar Jul 08 '25 08:07 rvleeuwen256