powershell-universal
powershell-universal copied to clipboard
Service randomly crashing - System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.
Version
4.4.0
Severity
Critical
Environment
msi
Steps to Reproduce
I have noticed an instance of PSU im hosting will randomly crash with the following error in the event logs
Application: Universal.Server.exe
CoreCLR Version: 7.0.222.60605
.NET Version: 7.0.2
Description: The process was terminated due to an unhandled exception.
Exception Info: System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
and this in the systemlogs
2024-10-01 03:23:15.829 -05:00 [INF] Groom date is: 9/1/2024 8:23:15 AM
2024-10-01 03:23:17.240 -05:00 [INF] Finished groom job.
2024-10-01 03:24:12.844 -05:00 [INF] Starting heartbeat job.
2024-10-01 03:24:26.640 -05:00 [ERR] Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](DbConnection dedicatedConnection, Func`2 func)
at Hangfire.SqlServer.SqlServerJobQueue.DequeueUsingSlidingInvisibilityTimeout(String[] queues, CancellationToken cancellationToken)
at Hangfire.SqlServer.SqlServerJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)
Expected behavior
no crash
Actual behavior
Service is crashing
Additional Environment data
Using MSI install with sql hosted in azure
Screenshots/Animations
No response
When the service is running, can you check the state of Hangfire? If you go to localhost:5000/hangfire, I'm curious about the number of queued jobs. I've seen similar errors, albeit not crashing, happen when there were thousands or millions of jobs queued in hangfire and it couldn't process that fast enough.
In the enqueued jobs, is there is specific type that is queued? heartbeat, groom etc? Is it on a queue of an online machine?
One quick way to work around the situation is to truncate the Hangfire.Job table: https://support.ironmansoftware.com/portal/en/kb/articles/kb0077-startup-failure-of-powershell-universal-server-in-multi-node-sql-environment
Skimming the jobs they look to all be related to ExecutionService.Execute
There is two queues
Top one is a machine that is technically online but not on this sql db anymore. We had to revert when we ran into another unrelated issue.
The bottom one is the another computer I must have been doing some testing on at some point.
I am not seeing the queue for localhost. However, in PSU, I see the correct computer as the only computer.
Could jobs be queueing up behind the scenes on a "computer/queue" that doesnt exist anymore in the sql instance?
Running the hangfile cleanup on the db now. @adamdriscoll
Shouldnt the queues be tied to a computer/node in PSU? if you remove a computer shouldnt that queue go away?
It should be. I'll leave this issue open to see if we can figure out why that isn't happening.
Duplicate of #3911