Hangfire.PostgreSql icon indicating copy to clipboard operation
Hangfire.PostgreSql copied to clipboard

Deadlock when using batches

Open mschuurmans opened this issue 5 years ago • 11 comments

Hangfire.Core: v1.7.8 Hangfire.PostgreSql: v1.6.3 Hangfire.Pro: v2.2.1 Dotnet Core: v2.2

Possible deadlock in the postgresql storage connector.

I don't see anything when using Trace logging for Hangfire.I think there is something wrong with the locking but don't know how to debug this. To test this i made a empty project with just hangfire and those 2 jobs.

The first enqueue is enqueued by BackgroundJob:

BackgroundJob.Enqueue(() => StartCache.Execute(null));

The StartCache job then inserts a batch job:

BatchJob.StartNew(x => {
    foreach(var id in customerIds) {
        x.Enqueue(() => CacheCustomer.Execute(id, null));
    }
});

The Batchjob is inserted en the jobs are enqueued. It says they are fetched but nothing happens: image

On the batches page it says that it's processing 1 job but nothing is happening: image

Let me know if you require any information or have some pointers on how to find what the issue is here.

mschuurmans avatar Jan 10 '20 07:01 mschuurmans

The lock table in the database shows the following (screenshot is taken at 7:45 UTC)

image

mschuurmans avatar Jan 10 '20 07:01 mschuurmans

Is this still a problem here? Since you closed it on the Hangfire project itself?

frankhommers avatar Apr 10 '20 00:04 frankhommers

It still is an issue. I closed it on the Hangfire project itself because the issue itself is not a "problem" with the Hangfire project. Currently Hangfire pro is only supported on SqlServer and Redis.pro.

I had some contact with Sergey and he told me that it's the way locking works in de Hangfire.Postgresql that is causing deadlocks. But it will not be an easy task to change this i think. On the other side you have the question how many people would use it, since afaict nobody is using Hangfire.postgresql with hangfire.pro batches (i could be very wrong here..).

Sergey also told me that he is considering contributing to postgresql/mysql but he currently doesn't know when so you don't want to give any promises.

mschuurmans avatar Apr 10 '20 07:04 mschuurmans

How does it need to be changed according to Sergey then?

frankhommers avatar Apr 12 '20 16:04 frankhommers

Quoting Sergey

Batches in Hangfire.Pro use all the methods available in JobStorageTransaction and JobStorageConnection classes, so all the available methods should be implemented. But there are a lot of non-functional requirements, which I attempted to document here – https://github.com/HangfireIO/Hangfire/issues/1296 2. So things are much more complicated than I thought initially. And even worse, there’s no documented wisdom on the topic of storage abstractions other than serializability, so this will require a lot of time unfortunately.

As far as I know, the worst moment in the Hangfire.Postgresql implementation is distributed locking. Unlike Hangfire.SqlServer that uses application locks (some kind of advisory locks in PostgreSQL), which are completely in-memory data structures and bound to connections, in the current PostgreSQL implementation locks are based on a dedicated table and fully materializable, which make them hard to be released on a process failure and prone to deadlocks.

Second mail

These years I’ve studied hard on this topic in order to get everything as much simple as possible. I know the answer now, but it will take a lot of time to implement everything without revolution, i.e. in a non-backward-compatible way. I’m even considering to contribute to Hangfire.PostgreSQL and Hangfire.MySQL, but currently don’t know when, so don’t want to give any promises.

I didn’t do this before, because there are so many nuances between different storages, between their transaction isolation semantics (despite they are all ACID-compliant), especially when applied to storage abstractions, that my mind was completely blown up.

mschuurmans avatar Apr 12 '20 17:04 mschuurmans

Moving to locks like these: https://www.postgresql.org/docs/current/explicit-locking.html#ADVISORY-LOCKS wouldn't be that hard I guess. AFAIK that will require PostgreSQL 9.0 and upwards..

But I don't know if that's a good move.

frankhommers avatar Apr 12 '20 17:04 frankhommers

I'm don't know either, i don't know which versions people are running this on.

But besides that there were some earlier comments why advisory locks shouldn't be used here https://github.com/frankhommers/Hangfire.PostgreSql/issues/82

This is from a few years back i can't really tell if this is still relevant.

mschuurmans avatar Apr 12 '20 18:04 mschuurmans

Yeah. The current locks are cross server. Advisory locks are not. But it could be an option if you're running just one hangfire server.

frankhommers avatar Apr 13 '20 14:04 frankhommers

Hi, any news on this?

mattgenious avatar Mar 10 '21 10:03 mattgenious

Is this still the case with 1.9.3

frankhommers avatar Dec 03 '21 16:12 frankhommers

Yes and no, I will explaim Yes: the exception is still in there No: it's work

But becuase the exception we cannot be sure that the code is working so bottom line I think its not working

The exception is:

(Hangfire.PostgreSql.PostgreSqlDistributedLock) hangfire:batch:086beed2-3a13-4eff-8693-fd117f43de34:lock: Failed to lock with transaction Npgsql.PostgresException 40001: could not serialize access due to concurrent update at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|213_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage) at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery() at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action2 paramReader) in /_/Dapper/SqlMapper.cs:line 2858 at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in /_/Dapper/SqlMapper.cs:line 581 at Hangfire.PostgreSql.PostgreSqlDistributedLock.TransactionLockHandler.Lock(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options) 2022-01-26 10:14:09 [WARN] (Hangfire.PostgreSql.PostgreSqlDistributedLock) hangfire:batch:086beed2-3a13-4eff-8693-fd117f43de34:lock: Failed to lock with transaction Npgsql.PostgresException 40001: could not serialize access due to concurrent update at Npgsql.Internal.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|213_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage) at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken) at Npgsql.NpgsqlDataReader.NextResult() at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteReader(CommandBehavior behavior, Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken) at Npgsql.NpgsqlCommand.ExecuteNonQuery() at Dapper.SqlMapper.ExecuteCommand(IDbConnection cnn, CommandDefinition& command, Action2 paramReader) in //Dapper/SqlMapper.cs:line 2858 at Dapper.SqlMapper.ExecuteImpl(IDbConnection cnn, CommandDefinition& command) in //Dapper/SqlMapper.cs:line 581 at Hangfire.PostgreSql.PostgreSqlDistributedLock.TransactionLockHandler.Lock(String resource, TimeSpan timeout, IDbConnection connection, PostgreSqlStorageOptions options)

LirazHaim86 avatar Jan 25 '22 22:01 LirazHaim86