SQL Server Deadlocks on simultanous start of multiple cluster nodes
Describe the bug I don't know if that's a bug, a known limitation or a simple configuration error, but everytime our application(s) are (re)started simultanously, there is a chance for a Deadlock on the SQL Server Job Store and therefore an Exception crashing the victim instance.
Our setup is:
- 2 VMs hosting the same ASP.NET Core 9 Application with some Quartz Jobs via IIS (behind a load balancer, but that wouldn't matter anyways)
- 1 SQL Server that also acts as the JobStore for Quartz and the applications business state/domain object/fancy new name for the same concept.
- Quartz is configured to work in a cluster of those 2 applications and to override Job Data to update changed triggers/schedules/...
And that's it. And since that doesn't look too exotic an architecture, that I suspect I'm overseeing something. I tried to track down the error, even in the code base, and that's what I could find out
- There is an event where multiple (both, in this case) applications using the same SQL Server Job store are (re)started at the same time. It sounds unlikely, but e.g. IIS could run scheduled recycles/restarts at fixed times with synchronized clocks
- Those applications host Quartz with the Hosting integration and are started the same way (see https://github.com/quartznet/quartznet/blob/main/src/Quartz/Hosting/QuartzHostedService.cs)
- Quartz tries to match jobs, triggers and schedules with the ones in the Job store within the call to
GetScheduler- When that happens with even a slight delay of a few hundred milliseconds, everything seems to be fine
- If the two application instances start with less delay, we see this Deadlock Exception raised by the SQL Client terminating the victim application instance:
Quartz.JobPersistenceException: Couldn't retrieve trigger: <German Version of the Generic SQL Server Deadlock Message>
---> Microsoft.Data.SqlClient.SqlException (0x80131904): <German Version of the Generic SQL Server Deadlock Message>
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at Microsoft.Data.SqlClient.SqlDataReader.TryHasMoreRows(Boolean& moreRows)
at Microsoft.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
at Microsoft.Data.SqlClient.SqlDataReader.ReadAsyncExecute(Task task, Object state)
at Microsoft.Data.SqlClient.SqlDataReader.InvokeAsyncCall[T](SqlDataReaderBaseAsyncCallContext`1 context)
--- End of stack trace from previous location ---
at Quartz.Impl.AdoJobStore.StdAdoDelegate.SelectTrigger(ConnectionAndTransactionHolder conn, TriggerKey triggerKey, CancellationToken cancellationToken)
at Quartz.Impl.AdoJobStore.JobStoreSupport.RetrieveTrigger(ConnectionAndTransactionHolder conn, TriggerKey triggerKey, CancellationToken cancellationToken)
ClientConnectionId:<a GUID>
Error Number:1205,State:51,Class:13
--- End of inner exception stack trace ---
at Quartz.Impl.AdoJobStore.JobStoreSupport.RetrieveTrigger(ConnectionAndTransactionHolder conn, TriggerKey triggerKey, CancellationToken cancellationToken)
at Quartz.Impl.AdoJobStore.JobStoreSupport.ExecuteInNonManagedTXLock[T](String lockName, Func`2 txCallback, Func`3 txValidator, Nullable`1 requestorId, CancellationToken cancellationToken)
at Quartz.Impl.AdoJobStore.JobStoreSupport.ExecuteInNonManagedTXLock[T](String lockName, Func`2 txCallback, Func`3 txValidator, Nullable`1 requestorId, CancellationToken cancellationToken)
at Quartz.Core.QuartzScheduler.GetTrigger(TriggerKey triggerKey, CancellationToken cancellationToken)
at Quartz.Core.QuartzScheduler.RescheduleJob(TriggerKey triggerKey, ITrigger newTrigger, CancellationToken cancellationToken)
at Quartz.Xml.XMLSchedulingDataProcessor.ScheduleJobs(IScheduler sched, CancellationToken cancellationToken)
at Quartz.ServiceCollectionSchedulerFactory.InitializeScheduler(IScheduler scheduler, CancellationToken cancellationToken)
at Quartz.ServiceCollectionSchedulerFactory.GetScheduler(CancellationToken cancellationToken)
at Quartz.QuartzHostedService.StartAsync(CancellationToken cancellationToken)
This is the configuration we're running within an ASP.NET Core 9 Host:
builder.Services.Configure<QuartzOptions>(quartzOptions =>
{
quartzOptions.SchedulerId = "AUTO";
quartzOptions.SchedulerName = builder.Environment.ApplicationName;
quartzOptions.Scheduling.IgnoreDuplicates = true;
quartzOptions.Scheduling.OverWriteExistingData = true;
});
builder.Services.AddQuartz(quartz =>
{
quartz.UsePersistentStore<DeadlockAwareJobStore>(store =>
{
store.UseSqlServer(<connectionString>);
store.PerformSchemaValidation = false;
store.UseProperties = true;
store.UseSystemTextJsonSerializer();
store.RetryInterval = TimeSpan.FromSeconds(15);
store.UseClustering(cluster =>
{
cluster.CheckinMisfireThreshold = TimeSpan.FromSeconds(20);
cluster.CheckinInterval = TimeSpan.FromSeconds(10);
});
});
}).AddQuartzHostedService(q => q.WaitForJobsToComplete = true);
Version used 3.14.0
To Reproduce I mean, the problem is, that If you could reproduce Deadlocks, you would have found the reason already anyways. Since starting multiple cluster nodes simultanously without any primary/secondary - hierarchy between them, concurrent access and even Deadlocks sound like they are part of the game? I tried to provide a repro that focuses on how Quartz handles Deadlocks, but I hope that still helps:
//Would need a configured database on the default MSSQLLocalDB instance called "quartznet-deadlock-repro"
using Microsoft.Data.SqlClient;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Quartz;
using Quartz.Impl.AdoJobStore;
using Quartz.Spi;
using System.Reflection;
var builder = Host.CreateDefaultBuilder(args);
builder.ConfigureServices(services =>
{
services.Configure<QuartzOptions>(quartzOptions =>
{
quartzOptions.SchedulerId = "AUTO";
quartzOptions.Scheduling.IgnoreDuplicates = true;
quartzOptions.Scheduling.OverWriteExistingData = true;
});
services.AddQuartz(quartz =>
{
quartz.UsePersistentStore(store =>
{
store.UseBinarySerializer();
store.PerformSchemaValidation = false;
store.UseSqlServer("Server=(localdb)\\MSSQLLocalDB;Initial Catalog=quartznet-deadlock-repro;Integrated Security=true");
store.SetProperty("quartz.jobStore.driverDelegateType", typeof(ThrowingSqlServerDelegate).AssemblyQualifiedName!);
store.UseClustering(cluster =>
{
cluster.CheckinMisfireThreshold = TimeSpan.FromSeconds(20);
cluster.CheckinInterval = TimeSpan.FromSeconds(10);
});
});
quartz.ScheduleJob<AJobDoingNothing>(t => t.WithSimpleSchedule(s => s.WithIntervalInSeconds(1).RepeatForever()).StartNow());
}).AddQuartzHostedService();
});
await builder.RunConsoleAsync();
public class AJobDoingNothing : IJob
{
public Task Execute(IJobExecutionContext context) => Task.CompletedTask;
}
public class ThrowingSqlServerDelegate : SqlServerDelegate
{
public override Task<IOperableTrigger?> SelectTrigger(ConnectionAndTransactionHolder conn, TriggerKey triggerKey, CancellationToken cancellationToken = default)
{
var sqlException = CreateSqlException(1205, "Die Transaktion (Prozess-ID 55) befand sich auf Sperre Ressourcen aufgrund eines anderen Prozesses in einer Deadlocksituation und wurde als Deadlockopfer ausgewählt. Führen Sie die Transaktion erneut aus.");
var jobPersistenceException = new JobPersistenceException("Couldn't retrieve trigger: " + sqlException.Message, sqlException);
throw jobPersistenceException;
}
//Thanks to https://stackoverflow.com/questions/1386962/how-to-throw-a-sqlexception-when-needed-for-mocking-and-unit-testing
private static SqlException CreateSqlException(int number, string message = "test")
{
Exception? innerEx = null;
var c = typeof(SqlErrorCollection).GetConstructors(BindingFlags.NonPublic | BindingFlags.Instance);
SqlErrorCollection errors = (c[0].Invoke(null) as SqlErrorCollection)!;
var errorList = (errors.GetType().GetField("_errors", BindingFlags.Instance | BindingFlags.NonPublic)?.GetValue(errors) as List<object>)!;
c = typeof(SqlError).GetConstructors(BindingFlags.NonPublic | BindingFlags.Instance);
var nineC = c.FirstOrDefault(f => f.GetParameters().Length == 9)!;
SqlError sqlError = (nineC.Invoke([number, (byte)0, (byte)0, "", "", "", (int)0, (uint)0, innerEx]) as SqlError)!;
errorList.Add(sqlError);
SqlException ex = (Activator.CreateInstance(typeof(SqlException), BindingFlags.NonPublic | BindingFlags.Instance, null, new object?[] { message, errors,
innerEx, Guid.NewGuid() }, null) as SqlException)!;
return ex;
}
}
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net9.0</TargetFramework>
<RootNamespace>quartz_deadlock_repro</RootNamespace>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.Data.SqlClient" Version="6.0.2" />
<PackageReference Include="Microsoft.Extensions.Hosting" Version="9.0.6" />
<PackageReference Include="Quartz.Extensions.Hosting" Version="3.14.0" />
</ItemGroup>
</Project>
Expected behavior I culd see multiple outcomes of resolving it:
- I am completely overseeing an obvious configuration error resolving that, then please accept my deepest apologies
- Using an random Delay within application startup e.g. by using
Task.Delay(Random.Next.., to have a high chance for the cluster nodes not starting at the same time - There is a way of teaching Quartz to handle Deadlocks. E.g. I would test a custom implementation of
JobStoreTXthat simply re-executes a transaction that was affected by a deadlock:
public class DeadlockAwareJobStore : JobStoreTX
{
protected override async Task<T> ExecuteInLock<T>(
string? lockName,
Func<ConnectionAndTransactionHolder, Task<T>> txCallback,
CancellationToken cancellationToken = default)
{
try
{
return await base.ExecuteInLock(lockName, txCallback, cancellationToken);
}
catch (SchedulerException ex) when (IndicatesDeadlock(ex))
{
await Task.Delay(DbRetryInterval, cancellationToken).ConfigureAwait(false);
return await ExecuteInLock(lockName, txCallback, cancellationToken);
}
}
private static bool IndicatesDeadlock(Exception? ex)
{
if (ex is null)
return false;
if (ex is SqlException { Number: 1205 })
return true;
return IndicatesDeadlock(ex.InnerException);
}
}