dotnet-monitor
dotnet-monitor copied to clipboard
Upgrade to version 8 causes hanging of processes which dotnet monitor is supposed to be monitoring
Description
We recently upgraded our image from mcr.microsoft.com/dotnet/monitor:7 (which I think was specifically on mcr.microsoft.com/dotnet/monitor@sha256:0cda57851419ac60878166c29dd18f09915ca2047de60444847f91338db8a326) to mcr.microsoft.com/dotnet/monitor:8 (specifically mcr.microsoft.com/dotnet/monitor@sha256:9324ee5e07e7ce8428772aa6269438b3a6688649d1746f41190acb0a045ff857) in our production environment and some of the dotnet processes being run in the container which is supposed to be being monitored started emitting logs of the form:
The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command from a Diagnostic Port.
DOTNET_DiagnosticPorts="/dotnet-diagnostics/port.sock"
DOTNET_DefaultDiagnosticPortSuspend=0
When the process emitted that log it then just hung in the SNl state, and we terminated it with an automatic cutoff for duration of time (see notes in configuration but these C# processes are supposed to be very short lived, which I suspect is where the issue stems from.)
At the same time there also seems to be a log message from the dotnet-monitor container saying:
{
"Timestamp": "2024-05-02T08:39:56.8474094Z",
"EventId": 99,
"LogLevel": "Warning",
"Category": "Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource",
"Message": "Unable to fully initialize endpoint for process 792382. Some monitoring features may be unavailable.",
"Exception": "System.AggregateException: One or more errors occurred. (Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.)\n ---> System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.\n at System.Collections.Generic.Dictionary`2.TryInsert(TKey key, TValue value, InsertionBehavior behavior)\n at Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource.ResumeAndQueueEndpointInfo(ReversedDiagnosticsServer server, IpcEndpointInfo info, CancellationToken token)\n --- End of inner exception stack trace ---\n at Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource.ResumeAndQueueEndpointInfo(ReversedDiagnosticsServer server, IpcEndpointInfo info, CancellationToken token)\n at Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource.<>c__DisplayClass19_1.<<ListenAsync>b__0>d.MoveNext()",
"State": {
"Message": "Unable to fully initialize endpoint for process 792382. Some monitoring features may be unavailable.",
"processId": 792382,
"{OriginalFormat}": "Unable to fully initialize endpoint for process {processId}. Some monitoring features may be unavailable."
},
"Scopes": []
}
I am afraid I have not yet managed a minimal reproduction as we saw this in production, not our staging environments where load is much lower. I wanted to raise this first in case you were able to make progress without it!
We have never seen this issue before in the version 7, so it came as quite the surprise when all of these processes started hanging 😄
Configuration
We run dotnet monitor in a kubernetes cluster (v1.25) alongside a single other container. That container has a main entry point which is a long lived process (length of the container), but that process also spins up lots of sub dotnet processes (being used as a customer merge driver in git).
We only saw this problem happen as part of those sub dotnet processes, which have been known to take seconds (or less) to execute, and it seems to be fairly random (i.e. a lot of those sub processes were fine but a minority hung on this bug)
We already have a lot of log messages coming out of dotnet monitor to the tune of:
{"Timestamp":"2024-05-02T08:39:54.2599063Z","EventId":52,"LogLevel":"Warning","Category":"Microsoft.Diagnostics.Tools.Monitor.ServerEndpointInfoSource","Message":"Unexpected timeout from process 792203. Process will no longer be monitored.","State":{"Message":"Unexpected timeout from process 792203. Process will no longer be monitored.","processId":"792203","{OriginalFormat}":"Unexpected timeout from process {processId}. Process will no longer be monitored."},"Scopes":[]}
{"Timestamp":"2024-05-02T08:39:54.2600976Z","EventId":44,"LogLevel":"Information","Category":"Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService","Message":"Stopping collection rules.","State":{"Message":"Stopping collection rules.","{OriginalFormat}":"Stopping collection rules."},"Scopes":[{"Message":"TargetProcessId:792203 TargetRuntimeInstanceCookie:4d9379280ea940dd83d09bc58036e85f","TargetProcessId":"792203","TargetRuntimeInstanceCookie":"4d9379280ea940dd83d09bc58036e85f"}]}
{"Timestamp":"2024-05-02T08:39:54.2601133Z","EventId":45,"LogLevel":"Information","Category":"Microsoft.Diagnostics.Tools.Monitor.CollectionRules.CollectionRuleService","Message":"All collection rules have stopped.","State":{"Message":"All collection rules have stopped.","{OriginalFormat}":"All collection rules have stopped."},"Scopes":[{"Message":"TargetProcessId:792203 TargetRuntimeInstanceCookie:4d9379280ea940dd83d09bc58036e85f","TargetProcessId":"792203","TargetRuntimeInstanceCookie":"4d9379280ea940dd83d09bc58036e85f"}]}
Suggesting we might already be pushing dotnet monitor to it's limit
Regression?
This previously worked with dotnet monitor version 7 (most recent)
Other information
The exception above seems to point to a concurrency issue, but I couldn't immediately figure it out hence why I am raising an issue not a PR
We would actually be perfectly happy for dotnet monitor to not monitor these short lived processes, so if we could do that it would be excellent 😅
I am by no means the expert on the code but at the very least the exception seems to be pointing to https://github.com/dotnet/dotnet-monitor/blob/ab0774688b140dedcbca8be761b6ce677301251f/src/Tools/dotnet-monitor/EndpointInfo/ServerEndpointInfoSource.cs#L223
My guess is that the _activeEndpointServiceScopes dictionary is being accessed concurrently (and so should be a concurrent dictionary). That dictionary doesn't seem to exist in the v7 version (introduced in https://github.com/dotnet/dotnet-monitor/pull/4964)
Welcome to dotnet-monitor!
Thanks for creating your first issue; let us know what you think of dotnet-monitor by filling out our survey.
Thanks for reporting this! We'll try to take a look into this and investigate with the information that you've provided. However, I don't have a timeline for when we can give you definitive solution.
We would actually be perfectly happy for dotnet monitor to not monitor these short lived processes, so if we could do that it would be excellent 😅
You can accomplish this by clearing the DOTNET_DiagnosticPorts environment variable in your top-level .NET process (presumably the container entrypoint process) before launching the child processes e.g. Environment.SetEnvironmentVariable("DOTNET_DiagnosticPorts", null);. This will prevent the child processes from being observed by .NET Monitor in your current configuration but still allow the entrypoint process to be observed. This environment variable is only read by the runtime at startup, so clearing it anywhere in your managed code will not affect the observability of the process in which you clear it.
Thanks for speedy response @jander-msft! I will give that a go and see how we get on 😄 Let me know if I can do anything to help
In the disposal method it's using the _activeEndpointsSemaphore around a mutation of _activeEndpointServiceScopes, so I think it needs the same done around the call to .Add in this case?
await _activeEndpointsSemaphore.WaitAsync(CancellationToken.None);
try
{
serviceScopes = new List<AsyncServiceScope>(_activeEndpointServiceScopes.Values);
_activeEndpointServiceScopes.Clear();
}
finally
{
_activeEndpointsSemaphore.Release();
}
Yeah, I determined that to be the issue as well. I'll put up a PR shortly.
As a heads up, this should be fixed in the upcoming .NET Monitor 8.0 servicing release and .NET Monitor 9.0 Preview 4 release.
The fix is now available in the 8.0.2 release. Please check the release notes for any additional changes that may impact your usage of .NET Monitor: https://github.com/dotnet/dotnet-monitor/releases/tag/v8.0.2
Thanks!