powershell-universal Service randomly crashing - System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.

trafficstars

Version

4.4.0

Severity

Critical

Environment

msi

Steps to Reproduce

I have noticed an instance of PSU im hosting will randomly crash with the following error in the event logs

Application: Universal.Server.exe
CoreCLR Version: 7.0.222.60605
.NET Version: 7.0.2
Description: The process was terminated due to an unhandled exception.
Exception Info: System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)

and this in the systemlogs

2024-10-01 03:23:15.829 -05:00 [INF] Groom date is: 9/1/2024 8:23:15 AM
2024-10-01 03:23:17.240 -05:00 [INF] Finished groom job.
2024-10-01 03:24:12.844 -05:00 [INF] Starting heartbeat job.
2024-10-01 03:24:26.640 -05:00 [ERR] Execution Worker is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04
System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.
   at Microsoft.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at Microsoft.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at Microsoft.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry, SqlConnectionOverrides overrides)
   at Microsoft.Data.SqlClient.SqlConnection.Open(SqlConnectionOverrides overrides)
   at Hangfire.SqlServer.SqlServerStorage.CreateAndOpenConnection()
   at Hangfire.SqlServer.SqlServerStorage.UseConnection[T](DbConnection dedicatedConnection, Func`2 func)
   at Hangfire.SqlServer.SqlServerJobQueue.DequeueUsingSlidingInvisibilityTimeout(String[] queues, CancellationToken cancellationToken)
   at Hangfire.SqlServer.SqlServerJobQueue.Dequeue(String[] queues, CancellationToken cancellationToken)
   at Hangfire.Server.Worker.Execute(BackgroundProcessContext context)
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state)
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state)

Expected behavior

no crash

Actual behavior

Service is crashing

Additional Environment data

Using MSI install with sql hosted in azure

Screenshots/Animations

No response

Oct 01 '24 13:10 mikedhanson

When the service is running, can you check the state of Hangfire? If you go to localhost:5000/hangfire, I'm curious about the number of queued jobs. I've seen similar errors, albeit not crashing, happen when there were thousands or millions of jobs queued in hangfire and it couldn't process that fast enough.

Oct 01 '24 13:10 adamdriscoll

Oct 01 '24 14:10 mikedhanson

In the enqueued jobs, is there is specific type that is queued? heartbeat, groom etc? Is it on a queue of an online machine?

One quick way to work around the situation is to truncate the Hangfire.Job table: https://support.ironmansoftware.com/portal/en/kb/articles/kb0077-startup-failure-of-powershell-universal-server-in-multi-node-sql-environment

Oct 01 '24 14:10 adamdriscoll

Skimming the jobs they look to all be related to ExecutionService.Execute

There is two queues

Top one is a machine that is technically online but not on this sql db anymore. We had to revert when we ran into another unrelated issue.

The bottom one is the another computer I must have been doing some testing on at some point.

I am not seeing the queue for localhost. However, in PSU, I see the correct computer as the only computer.

Oct 01 '24 14:10 mikedhanson

Could jobs be queueing up behind the scenes on a "computer/queue" that doesnt exist anymore in the sql instance?

Oct 01 '24 14:10 mikedhanson

Running the hangfile cleanup on the db now. @adamdriscoll

Shouldnt the queues be tied to a computer/node in PSU? if you remove a computer shouldnt that queue go away?

Oct 01 '24 14:10 mikedhanson

It should be. I'll leave this issue open to see if we can figure out why that isn't happening.

Oct 01 '24 16:10 adamdriscoll

Duplicate of #3911

Nov 01 '24 16:11 adamdriscoll

powershell-universal powershell-universal copied to clipboard

Service randomly crashing - System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool.

Version

Severity

Environment

Steps to Reproduce

Expected behavior

Actual behavior

Additional Environment data

Screenshots/Animations

powershell-universal
powershell-universal copied to clipboard