newrelic-dotnet-agent
newrelic-dotnet-agent copied to clipboard
Startup hang caused by the deployment of New Relic
Description
At a very high level; once we deploy New Relic (both infra agent and dotnet agent) we experience start-up hangs in our dotnet6/dotnet8 apps.
We ran into this problem last year: https://github.com/dotnet/aspnetcore/issues/50317 but we didn't realise New Relic was the root cause until we tried to roll out it again this week (because I saw this fix) and the problem re-appeared.
Specifically the error that is logged at the highest level is ASP.NET Core app failed to start after 120000 milliseconds
Expected Behavior
New Relic should not cause the app to hang on start-up.
Troubleshooting or NR Diag results
It's very difficult to establish a troubleshooting workflow. It's a very intermittent and sporadic hang. It almost feels like the app waits to connect to the NR backend and it then times out after 2 mins exactly.
Steps to Reproduce
Unable to reproduce at all, only happens in production very intermittently/sporadically.
Your Environment
Windows Server 2022 (was 2019) - experienced it on both OS versions.
dotnet8 (was dotnet6) - experienced it on both dotnet versions.
On-prem
Hosted via IIS
NewRelicDotNetAgent_10.30.0_x64.msi (deployed using DSC: ADDLOCAL=NETCoreSupport)
newrelic-infra-amd64.1.57.1.msi (deployed using DSC)
Additional context
I know it's very long and information overload, but I did a ton of work in this ticket: https://github.com/dotnet/aspnetcore/issues/50317
Our newrelic-infra.yml looks like this:-
# required
license_key: REDACTED
# to save cpu; we don't send process metrics, so we don't need the local agent to collect them
metrics_process_sample_rate: -1
# to save cpu; we don't use any of the plugins, so turn them off locally
disable_all_plugins: true
# to save cpu; we don't have removeable drives, so turn off detection locally
win_removable_drives: false
# to save cpu; we don't care to monitor the WU state of a host
windows_updates_refresh_sec: -1
# to save cpu; we don't care to monitor the services.msc of a host
windows_services_refresh_sec: -1
Our newrelic.config looks like this:-
<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
<service licenseKey="REDACTED"/>
<application/>
<log level="off"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false"/>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<distributedTracing enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration>
Our appsettings.json looks like this:-
{
"Logging": {
"LogLevel": {
"Default": "Information",
"Log4Net": "Debug",
"Microsoft": "Warning",
"Microsoft.Hosting.Lifetime": "Information"
}
},
"AllowedHosts": "*",
"NewRelic.AppName": "REDACTED",
"NewRelic.AgentEnabled": "true"
}
https://new-relic.atlassian.net/browse/NR-322365
Places I can find that match to 120 seconds:-
https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/Commands/ThreadProfilerCommandArgs.cs#L16
https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs#L33
https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/DataTransport/DataStreamingService.cs#L160
https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/AgentHealth/AgentHealthReporter.cs#L25
I don't know the architecture of the agent (yet), but I guess the most likely candidate is which ever code path is invoked during start up.
My suspicion initially falls on
https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs#L79
As we keep the default settings and I think we would fall into the sync path (and therefore blocking).
Hmmm, https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-syncStartup says it defaults to false.
So does that mean it is https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-requestTimeout which implies that it is 120 seconds as we keep sendDataOnExit to the default value (false).
Because you are not enabling send data on exit, or sync startup, the agent handles its connection logic on a separate thread pool thread via a timer that should be triggered without a delay. https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/DataTransport/ConnectionManager.cs#L82 I don't think any of the other duration settings that you found in the code are related to this problem, because those things with the exception of the connect step (which is happening on a different thread), all require the agent to complete the connect process first.
In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs. Because these files can be potentially sensitive, the best way to get support on this issue to work with our New Relic support team because they have the ability to receive these types of files without having them shared publicly on github. If you are unable to send these files to New Relic support, you can attempt to analyze the memory dump, but we'll need to know which threads/tasks are blocked by which other threads/tasks.
Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system. Usually these calls to the external system are something that the agent also instruments. We have a default list of known instrumented methods that have caused this problem. https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/AgentShim.cs#L43-L49 We also allow extending this list using an environment variable. https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Core/AgentShim.cs#L69 The difficult part will be determining what to add to that list. In the linked aspnetcore issue you provided a sample startup.cs.txt file, and in that file it looks like you might be using a postgresql database with the configuration manager. If that's the case you may need to add the classes and methods defined in the sql instrumentation xml associated to the Postgres comments. Here is an example of one of those sections. https://github.com/newrelic/newrelic-dotnet-agent/blob/709ddb67d667a7654dc9e22f7351c16a8b3b7f95/src/Agent/NewRelic/Agent/Extensions/Providers/Wrapper/Sql/Instrumentation.xml#L140-L149
Hi @nrcventura ,
Appreciate the detail reply.
In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs.
I would love to get a memory dump. Unfortunately, this is pretty hard because we only know it's hung after the deploy has failed and the process has been automatically killed off by IIS because it is deemed "unhealthy". I can change pretty much anything, but it's not something I can do quickly as it would effect the entire CI/CD pipeline.
I did however push out the change to the logging late on Friday, and this morning it has happened twice already.
<log level="all" maxLogFileSizeMB="100"/>
Last log:-
[2024-10-07T09:38:00.405Z, PID: 31444] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::Terminate
Timeout log:-
[2024-10-07T09:37:59.247Z, PID: 31444] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489
First log:-
[2024-10-07T09:35:59.161Z, PID: 31444] [aspnetcorev2.dll] Initializing logs for 'C:\Program Files\IIS\Asp.Net Core Module\V2\aspnetcorev2.dll'. Process Id: 31444. File Version: 18.0.24141.6. Description: IIS ASP.NET Core Module V2. Commit: .
Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system.
Thanks, that gives me a thread to pull on. We do store out config keys etc in PG and all apps reach out to PG at startup time to get various details, settings etc.
I'll try adding each method one by one to NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST and see if the failures stop (this may take some time given the intermittent nature of the problem).
I've attached the relevant logs from the most recent failure.
newrelic_agent_Allium.WebApp_099_31444.zip
Cheers, Indy
I've pushed out
[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader', [System.EnvironmentVariableTarget]::Machine)
To all of our DEMO hosts.
I guess if we don't get another error for a week we can be somewhat-confident it might be that.
Hard to prove.
Thanks for the update. Yes, these types of problems are very difficult to reproduce. You might be able to reproduce the problem more reliably if you enabled syncStartup which is one of the settings that you previously found.
I suspect that you may also need the NpgSql.Connection.Open method.
Just happened again, updated environment variable to:-
[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader;Npgsql.NpgsqlConnection.Open', [System.EnvironmentVariableTarget]::Machine)
Happened twice this morning, updated environment variable to:-
$methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);
I'm not convinced it (the hang on start-up) is fetching the configuration from PG, so I'm including all methods (that can be used) on that path. This should help us either rule it out or in.
Back to the waiting game!
Cheers, Indy
Tiny update from me;
- I tried
syncStartupfor local-dev reproduction, got absolutely no where. - I tried to block the domains below using Windows HOSTS file (collected using
ipconfig /displaydns) but the dot-net agent was fine, and simply logged retries with some sort of back off strategy so that rules out the NR backend <-> dot-net agent communication :-
0.0.0.0 collector-eu01.cell.eu.nr-data.net
0.0.0.0 collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-infra-api.eu.newrelic.com
0.0.0.0 infra-api.cell.eu.nr-data.net
0.0.0.0 infra-api.eu.newrelic.com
0.0.0.0 infrastructure-command-api.eu.newrelic.com
0.0.0.0 newrelic.com
0.0.0.0 tls12.eu.newrelic.com.cdn.cloudflare.net
- I did notice some HTTP 409 in the dot-net agent log, not sure if related/expected? It doesn't line up with any start-up hangs, but seeing a 409 in the wild is rare enough for me to note it.
- I did notice some 5xx in the infra agent log, but I guess that is to be expected to some extent.
- I peppered the CI/CD deploy logs with
Get-Date -Format "o"and hope to get an exact timeline going. I've got a theory that the presence of the NR agent(s) is enough to cause the start-up time to balloon and it only tips over 120 seconds sometimes. - I'm basically ignoring the infra agent for now, because the only apps that are effected are the dotnet8 apps (we have some legacy 472 apps) but we only installed dotnet-new support.
Cheers, Indy
If you are not seeing the problem when syncStartup is enabled, then I'm guessing that you are not running into the deadlock problem that we've seen in the past. I opened an issue to add a feature to just disable the usage of the ConfigurationManager altogether so that we can more easily rule that problem in or out before dedicating more time to identify what to add to that delayed initialization list.
I have seen other weird cases in certain environments where enabling agent logs higher than debug or info level have slowed down application startup enough (in those environments) to trigger the default aspnetore app default startup timeout. I don't think that this is likely to be the problem that you are experiencing, but https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#log-enabled explains how to disable our logging system (this does not affect capturing and forwarding agent log data).
Another situation where we have seen deadlocks occur is within the agent's usage of the .net Event Pipe which we use to collect garbage collection metrics for the application. The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource. The collection of this data can be disabled, and this has resolved some problems similar to this (usually noticed when the application is trying to end). To accomplish that you can add <add key="NewRelic.EventListenerSamplersEnabled" value="false" /> to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.
So, I don't want to jinx anything, but since I added:-
$methods = @(
"Npgsql.NpgsqlConnection.Open"
"Npgsql.NpgsqlCommand.ExecuteReader"
"Npgsql.NpgsqlDataReader.Read"
"Npgsql.NpgsqlConnection.OpenAsync"
"Npgsql.NpgsqlCommand.ExecuteReaderAsync"
"Npgsql.NpgsqlDataReader.ReadAsync"
);
$concat = $methods -join ';';
[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);
We've not had a start-up hang. I'm not saying that I think this is solved, but as time goes on my confidence does increase 🤞🏽
Cheers, Indy
Small update, not seen the exact same start-up hang. So I'm getting closer to saying it looks like this particular issue (getting external configuration from pg) is resolved. Right now NR is only deployed to our DEMO environment - the real test will be getting it to our LIVE environment (the hang was more frequent when we first attempted NR deployment last year).
If we get to the end of the week without another start-up hang, I'll spend next week getting onto a LIVE environment and see what happens.
Tangentially; we've had two instances of the same application hanging after the app had booted and warming up other external services (in this case BigQuery/GetTable). I don't think it is NR, but I don't really believe in coincidences so I thought I'd mention it here, just in case I need to keep the issue open a bit longer.
Cheers, Indy
Thank you for the updates. I'm not aware of any problems with someone using BigQuery, but we also do not currently support out of the box instrumentation for BigQuery, so I do not know what could be the problem there.
Argh!
The start-up hang has happened again.
From event viewer:-
[2024-10-16T06:11:56.536Z, PID: 8672] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489
I have a NewRelic.Profiler.8672.log but I don't have the rotation-based for [pid: 8672]. The last 10 lines in the rotation based log file (newrelic_agent_VehicleData.WebApp_029.log) are:-
2024-10-16 06:09:13,931 NewRelic DEBUG: [pid: 19716, tid: 111] Request(cce3a86f-56b1-4ae7-8a55-6043b9477b3d): Invocation of "analytic_event_data" yielded response : {}
2024-10-16 06:09:13,931 NewRelic FINEST: [pid: 19716, tid: 111] Transaction Event harvest finished.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 96] Error Event harvest starting.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 95] Custom Event harvest starting.
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionAsyncWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:OpenAsync(System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracerAsync, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.OpenAsync(System.Threading.CancellationToken)
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:Open(System.Boolean,System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.Open(System.Boolean,System.Threading.CancellationToken)
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.SqlCommandWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlCommand, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:ExecuteReader(System.Data.CommandBehavior), RequestedWrapperName: SqlCommandTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlCommand.ExecuteReader(System.Data.CommandBehavior)
Now looking into the other avenues of investigation.
Cheers, Indy
The profiler log is only going to have data for the profiler, but we also will want to see the managed agent logs. Managed agent logs can contain information that is potentially sensitive, especially at higher log levels. The best way for us to get these files would to be work with our New Relic support team since they have the ability to receive these types of files without having them shared publicly on github.
Hi @jaffinito,
Can you clarify what you mean by managed agent logs? I believe I uploaded those over here.
I do review the logs before I upload them. I can open a ticket through out NR account so that we can share logs if something sensitive is in them.
Cheers, Indy
Happened again tonight at
[2024-10-17T18:13:11.391Z, PID: 7804] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489
There is only a profiler log, there aren't any agent logs that pertain to 7804.
Edit* I've not had the opportunity to pursue the other avenues of investigation.
Cheers, Indy
To accomplish that you can add
<add key="NewRelic.EventListenerSamplersEnabled" value="false" />to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.
This is now done. I've modified "C:\ProgramData\New Relic\.NET Agent\newrelic.config" it now looks like this:-
<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc. All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
<appSettings>
<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
</appSettings>
<service licenseKey="REDACTED"/>
<application/>
<log level="all" maxLogFileSizeMB="100"/>
<allowAllHeaders enabled="true"/>
<attributes enabled="true">
<exclude>request.headers.cookie</exclude>
<exclude>request.headers.authorization</exclude>
<exclude>request.headers.proxy-authorization</exclude>
<exclude>request.headers.x-*</exclude>
<include>request.headers.*</include>
</attributes>
<transactionTracer enabled="false"/>
<distributedTracing enabled="false"/>
<errorCollector enabled="false"/>
<browserMonitoring autoInstrument="false"/>
<threadProfiling>
<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
</threadProfiling>
<applicationLogging enabled="false"/>
<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
<slowSql enabled="false"/>
<distributedTracing enabled="false"/>
<codeLevelMetrics enabled="false"/>
</configuration>
I confirmed that changing the config did something by logs:-
2024-10-17 22:46:22,361 NewRelic INFO: [pid: 24308, tid: 48] Reconnecting due to configuration change
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.GCSamplerNetCore has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.CpuSampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.MemorySampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.ThreadStatsSampler has been requested to stop.
On the topic of environment variables does NEW_RELIC_DISABLE_SAMPLERS not do the same thing?
Found via grokking the sdk src:-
https://github.com/newrelic/newrelic-dotnet-agent/blob/a77919e5405283fc57ed0251b731c5c1f3047305/src/Agent/NewRelic/Agent/Core/Configuration/DefaultConfiguration.cs#L1666
Cheers, Indy
The DisableSamplers setting disables all of our samplers including the cpu and memory samplers. The one I recommended disabling only disables a couple of samplers which are the EventListener samplers for Garbage Collection metrics and threading metrics.
Happened again - I should explain that I've hijacked the CI/CD pipeline on weekday evenings (00 07,19-23 * * TUE-FRI) to basically deploy our entire application tier (we call this a 'Server Recycle') so I can have a constant amount of "deploys" happening.
What's interesting is that a Server Recycle is basically iisreset + hits the urls to warmup the apps. The binaries don't change.
Before I added all the PG methods we used to get the same startup hang on actual deploys too (e.g. the binaries have changed and we don't use iisreset, we just kill the respective pid of that w3wp.exe).
I'm beginning to think that we are playing whack-a-mole on a combination of our platform code + NR causes a start-up hang some of the time.
I think our current state of affairs is that when we do an iisreset and then warmup the apps we run the risk of the start-up hang.
To prove that I'll take out the iisreset in our Server Recycle and see what happens.
Cheers, Indy
I finally replicated it on LOCAL. THANK GOD.
We have a small console app that is invoked by our OMD/nagios setup every 1-5 seconds (I think), it looks like this (I've append .txt to make it upload-able).
I leave that running in the background using powershell:-
while($true) {Start-Sleep -Seconds 1; .\CWDotNetDiagClient.exe;}
And then I simulate what a deploy/server recycle does with this (this is copied directly from the scripts, a lot of it is many years old):-
$WebApplicationCreatorConfigurationDirectory = "D:\code\CodeweaversWorld\src\WebApplicationCreator\LocalConfiguration";
$WebApplicationCreatorExecutable = "D:\code\CodeweaversWorld\src\WebApplicationCreator\app\IISApplicationExecutor\bin\IISApplicationExecutor.exe";
$Timeout = 480;
$AbsoluteUrl = "http://datamining.localiis/services/application.svc/warmup";
while ($true) {
Get-WmiObject -Namespace 'root\WebAdministration' -class 'WorkerProcess' -Filter "NOT AppPoolName LIKE '%servicemonitor%'" `
| Select-Object -Property AppPoolName, ProcessId `
| ForEach-Object {
Write-Host "Killing '$($_.AppPoolName)' with pid '$($_.ProcessId)'";
Stop-Process -Id $_.ProcessId -Force -ErrorAction SilentlyContinue;
};
& "C:\Windows\System32\iisreset.exe" /stop ;
& "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "RESET" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
& "C:\Windows\System32\iisreset.exe" /start ;
Start-Sleep -Seconds 2;
& "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "ALL" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
Invoke-WebRequest -Uri $AbsoluteUrl -TimeoutSec $Timeout -UseBasicParsing -DisableKeepAlive -MaximumRedirection 0;
}
The clue re-reading the entire issue thread was:-
The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource.
Many thanks @nrcventura for that nugget, it took a while to percolate.
I really should get to bed, maybe tonight I won't dream about this issue.
Cheerio! Indy
Manage to replicate it a 2nd & 3d time (took 185 tries) on LOCAL, this time I was quick enough to grab a memory dump.
I'll share them on Monday via the NR support ticket route.
Meanwhile I had a quick poke around using dotnet dump:-
> syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
2 000001E2B298B030 5 1 000001E2B3550430 413c 27 000001a235c153b0 System.Collections.Generic.List`1[[System.WeakReference`1[[System.Diagnostics.Tracing.EventSource, System.Private.CoreLib]], System.Private.CoreLib]]
> setthread 27
> clrstack
OS Thread Id: 0x413c (27)
Child SP IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]
> clrstack -all
OS Thread Id: 0x26c0
Child SP IP Call Site
00000014A3A7C738 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a3a7c738] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A3A7C880 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A3A7C930 00007FFE141C873C System.Data.DataCommonEventSource..ctor()
00000014A3A7C990 00007FFE141C862D System.Data.DataCommonEventSource..cctor() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DataCommonEventSource.cs @ 12]
00000014A3A7D1F8 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7d1f8]
00000014A3A7D300 00007FFE141C85DA System.Data.Common.DbConnectionStringBuilder.Clear() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DbConnectionStringBuilder.cs @ 283]
00000014A3A7D340 00007FFE141BA91E Npgsql.NpgsqlConnectionStringBuilder.Init()
00000014A3A7D380 00007FFE141BA239 Npgsql.NpgsqlConnectionStringBuilder..ctor()
00000014A3A7D3B0 00007FFE141BA026 Npgsql.NpgsqlConnection..cctor()
00000014A3A7DC28 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7dc28]
00000014A3A7DD30 00007FFE141B9FCC Npgsql.NpgsqlConnection..ctor()
00000014A3A7DD70 00007FFE13913900 System.RuntimeType.CreateInstanceDefaultCtor(Boolean, Boolean) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 3878]
00000014A3A7DDD0 00007FFE141B97DA log4net.Appender.AdoNetAppender.CreateConnection(System.Type, System.String)
00000014A3A7DE10 00007FFE141B8958 log4net.Appender.AdoNetAppender.InitializeDatabaseConnection()
00000014A3A7DE80 00007FFE13963502 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7DF20 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7DF90 00007FFE13963330 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7E030 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7E0A0 00007FFE139618E1 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseChildrenOfLoggerElement(System.Xml.XmlElement, log4net.Repository.Hierarchy.Logger, Boolean)
00000014A3A7E160 00007FFE139606A5 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseLogger(System.Xml.XmlElement)
00000014A3A7E1E0 00007FFE1395DC02 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.Configure(System.Xml.XmlElement)
00000014A3A7E260 00007FFE1395D177 log4net.Repository.Hierarchy.Hierarchy.XmlRepositoryConfigure(System.Xml.XmlElement)
00000014A3A7E2D0 00007FFE13953004 log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.Stream)
00000014A3A7E340 00007FFE13951B9F log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E3D0 00007FFE139515D3 log4net.Config.XmlConfigurator.InternalConfigureAndWatch(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E440 00007FFE13911316 log4net.Config.XmlConfigurator.ConfigureAndWatch(System.IO.FileInfo)
00000014A3A7E4B0 00007FFE1390FEFD Codeweavers.Logging.Logger..cctor()
00000014A3A7ED68 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7ed68]
00000014A3A7EE70 00007FFE1390FCA5 Codeweavers.Logging.Logger.GetLogger(System.String)
00000014A3A7EED0 00007FFE1390F451 DataMining.WebApp.Program.Main(System.String[])
OS Thread Id: 0x3180
Child SP IP Call Site
00000014A427F8F0 00007ffec31fffa4 [DebuggerU2MCatchHandlerFrame: 00000014a427f8f0]
OS Thread Id: 0x367c
Child SP IP Call Site
00000014A52CD8F8 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a52cd8f8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A52CDA40 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A52CDAF0 00007FFE1396E434 System.Diagnostics.Tracing.FrameworkEventSource..ctor() [/_/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.Generators/Generators.EventSourceGenerator/FrameworkEventSource.g.cs @ 12]
00000014A52CDB40 00007FFE1396E34D System.Diagnostics.Tracing.FrameworkEventSource..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs @ 14]
00000014A52CE3A8 00007ffe7345c663 [HelperMethodFrame: 00000014a52ce3a8]
00000014A52CE4B0 00007FFE1396DFC5 System.Threading.ThreadPoolWorkQueue..ctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 442]
00000014A52CE570 00007FFE1396DB01 System.Threading.ThreadPool..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1430]
00000014A52CEDE8 00007ffe7345c663 [HelperMethodFrame: 00000014a52cede8]
00000014A52CEEF0 00007FFE1396D940 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].AwaitUnsafeOnCompleted[[System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter[[System.Boolean, System.Private.CoreLib]], System.Private.CoreLib]](ConfiguredValueTaskAwaiter<Boolean> ByRef, System.Runtime.CompilerServices.IAsyncStateMachineBox)
00000014A52CEF50 00007FFE1396B338 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardLoggingEventsFromBuffer>d__25.MoveNext()
00000014A52CF130 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF1A0 00007FFE139695A8 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardLoggingEventsFromBuffer()
00000014A52CF210 00007FFE13969199 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardingThreadExecute>d__24.MoveNext()
00000014A52CF300 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF370 00007FFE13968C38 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardingThreadExecute()
00000014A52CF3E0 00007FFE13968AED System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 501]
00000014A52CF420 00007FFE13968990 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 179]
00000014A52CF490 00007FFE13968655 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2345]
00000014A52CF780 00007ffe7345c663 [DebuggerU2MCatchHandlerFrame: 00000014a52cf780]
OS Thread Id: 0x413c
Child SP IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]
A quick Google around leads me to https://github.com/dotnet/runtime/issues/105682 https://github.com/dotnet/runtime/issues/104322
So it looks like when we connect to the event pipe just as the app is initialized it can cause a hang.
What is weird is, is why the deployment of NR causes the start-up hang for us.
That console app we've been using has been in place since 2023-03 and we've never seen the hang without NR agent(s) being deployed.
Cheers, Indy
A lot of the code in the DiagnosticSource library tries to do as little work as possible if there are no subscribers to a particular source (DiagnosticSource, EventPipe, ActivitySource, Meter, etc.). The New Relic agent enables an EventLIstener by default to subscribe to the EventPipe datasource in order to capture Garbage Collection metrics (this was the primary way to capture that type of data for .net core < 3.0, which is when we started collecting that data). When .net standard 2.1 came out, and .net core 3.1 was released, more apis became available for collecting garbage collector data, but the agent still needed to support older applications. Similar .net 6 has a better way to collect this data, but those apis are not readily available unless the agent can stop targeting .net standard 2.0 and switch over to targeting .net 6 or .net 8 directly (which would be a breaking change requiring a major release). In the meantime, I've seen the eventpipe datasource (when subscribed to from within an application instead of out-of-process) have its stability increase and decrease with different .net versions. At this point, we do have plans to migrate away from the eventpipe for collecting garbage collection data, but it will result in a breaking change because the data we will be able to capture will be different. We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general.
Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled.
Sorry for the delaying in sharing the memory dump - we are a part of a much larger organisation and it turns out that our child accounts (4721360, 4721361) in NR doesn't have "case message support". Trying to get that sorted out. This may take a while 😬
Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled.
Thanks, I'll try DisableSamplers next.
We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general.
Ah, we were wondering why dotnet counters weren't collected, so this is on the roadmap? As it may block the NR rollout. We need dotnet counters support.
Cheers, Indy
Ah, we were wondering why dotnet counters weren't collected, so this is on the roadmap? As it may block the NR rollout. We need dotnet counters support.
dotnet counters was a dotnet tool that was released around the same time that the EventCounter API was added to the DiagnosticSource library. It was a tool that allowed out-of-process collection of metric telemetry data. Previously the main way to get this telemetry information was via performance counters, which was a Windows-only API. The EventCounter API still exists, but a new Meter API, and that is the new OpenTelemetry compatible API that is recommended. Both the EventCounter and Meter APIs are compatible with dotnet counters and dotnet monitor (I believe). From the New Relic agent perspective, we wanted to focus less on enabling the capture of arbitrary sources of metric data (at least initially), and wanted to instead focus on a curated set of sources and metrics that we have found to be the most helpful and are consistently recommended in Microsoft and 3rd troubleshooting recommendations. The New Relic agent also collected these metrics before they became standardized. Now that they are more standardized, and some of them will be available out of the box with .NET 9, it is a good time for us to also standardize on that data (but these migrations can be difficult for our customers). There is a lot of overlap (of the most commonly referenced metrics) between the metrics that the New Relic agent automatically collects, and the data available in the dotnet counters tool.
In order to understand your question about dotnet counters support we need a little more information.
- Are you generating custom metrics with the
EventCounterorMeterAPIs? If so, which APIs and metric types are you using? - Are there specific Metrics that you are hoping to have collected for every application? Will the list of metrics vary per application?
- If the agent collected a standard set of metrics out of the box, but allowed the ability to configure the collection of additional
Meterswould that be helpful (this is basically what OpenTelemetry Automatic Instrumentation project supports)?
Previously the main way to get this telemetry information was via performance counters, which was a Windows-only API.
Yeah, that was a painful deprecation for us. We did try and complain about it over here.
Are you generating custom metrics with the EventCounter or Meter APIs? If so, which APIs and metric types are you using?
Currently we have around 30 custom Meters, we don't have any EventCounter(s) yet. For Meter's, we use CreateCounter, CreateHistogram, CreateObservableGauge (this is from a quick grep, so not exhaustive).
Are there specific Metrics that you are hoping to have collected for every application? Will the list of metrics vary per application?
We would expect the standard set emitted by the Core CLR at the very least (and these).
If the agent collected a standard set of metrics out of the box, but allowed the ability to configure the collection of additional Meters would that be helpful (this is basically what OpenTelemetry Automatic Instrumentation project supports)?
Very much so, that is exactly what we want/need.
The backstory here is that we tried NR a while back, it didn't go well (because of this issue). So we had to fall-back to otel/grafana, this has proved troublesome for unrelated reasons. Our parent company is huge, and they have Enterprise level contract with NR, so we decided to give NR another go; and that's when I realised it was the NR deployment causing the start-up hang.
We very much want to align with the parent company but missing the counters from the Core CLR would very much delay the rollout until that was supported/collected by the NR agent.
Hopefully that makes sense, I'm battling a cold 🤧
Cheers, Indy
Thank you for this feedback. This confirms my assumptions about the type of support that the New Relic agent should have.
Quick (bad) update; I tried with NEW_RELIC_DISABLE_SAMPLERS=true and I still get the start-up hang. Took 130 tries. I didn't get a memory dump this time - just left it running and checking it periodically.
I know there were other suggestions in this thread; I'll go through them in later this week/next week.
Cheers, Indy
Tried with <application disableSamplers="true"/> in "C:\ProgramData\New Relic\.NET Agent\newrelic.config" and I still got the start-up hang. This time after 109 tries.
The quest continues :)
Edit* I did check but we still don't have access to case message support; I'll update as soon as I do.
Cheers, Indy