Hangfire
Hangfire copied to clipboard
Implement an alternative to DisableConcurrentExecution attribute
The implementation is crappy for long-running jobs, because it requires a long-running distributed lock. And it's totally unclear what timeout to use, because it should cover the maximum possible duration of a background job. The following problems arise, when using high values for timeouts (consider 4 hours, for example):
- Any worker that fetched a job protected by this attribute will wait, and will wait for 4 hours in our example. Have multiple attributes – the whole processing could be stopped.
- For very high values, like 24 hours, resource governors may simply release the distributed lock, and we'll get 2 or more concurrent executions depending on the job duration.
If we are using low values, we could end up with permanent DistributedLockTimeoutException
or TimeoutException
depending on the version. There's the MutexAttribute
that was implemented as a gist, and it is using short distributed locks, but its implementation could be greatly simplified by using a single hash value instead of sets.
So if someone wants to deep dive into Hangfire internals and extensibility points, this is a good entry point!
Ah, and DisableConcurrentExecution
should also be marked with the ObsoleteAttribute
.
Strange. This mutex attribute doesn't block any concurrent jobs in my simple .net core 2.0 test case.
as shown the following 2 jobs should be blocked by the first running job and placed in the scheduled queue intead. However, it didn't happen.
here is the relevant codes:
var recurringJobManager = new RecurringJobManager();
recurringJobManager.AddOrUpdate("LongRunningTest", Job.FromExpression<HousekeepingService>(j => j.LongRunningTest()), jobCronSetting.HousekeepingCronExpression);
[Mutex("LongRunningTest")]
public void LongRunningTest()
{
_logger.LogDebug("Thread {threadId}: test starts.", Thread.CurrentThread.ManagedThreadId);
Thread.Sleep(15000);
_logger.LogDebug("Thread {threadId}: test ends.", Thread.CurrentThread.ManagedThreadId);
}
Any thoughts?
@alphaeleven, what storage are you using and what version of Hangfire.* packages do you have? Also, are you using interfaces for your job (e.g. IHousekeepingService)?
@odinserj Thanks for your quick reply.
<PackageReference Include="HangFire" Version="1.6.17" />
<PackageReference Include="Hangfire.MemoryStorage" Version="1.5.2" />
<PackageReference Include="Hangfire.Pro" Version="2.1.0" />
<PackageReference Include="Hangfire.Pro.Redis" Version="2.2.1" />
I used MemoryStorage for this test case. Does it matter?
Yes. there was an IHousekeepingService and I tried putting mutex on the interface, but didn't work either. So trying to keep debug process eaier, the interface is not in use any more.
Upgraded Hangfire to 1.6.19, but issue still persists.
Have you tried to reproduce it when using Hangfire.Pro.Redis
or Hangfire.SqlServer
packages instead? The locking feature heavily depends on the implementation of locks inside a storage package. If it's not implemented properly, the feature will not work.
@odinserj as you suggested it works only on Hangfire.Pro.Redis
and Hangfire.SqlServer
but not Hangfire.MemoryStorage
. Really appreciate your help on sorting this out.
Oops, actually my problem was that I was applying the attribute to the concrete method, but calling the method via the interface in the BackgroundJob.Enqueue<IMyInterface>(...) call. I had to apply the attribute to the method on the interface and then it started working.
Hi @odinserj I am dealing with the exact behavior that you mentioned in your initial post.
I have a long-running job that can take anywhere from 30 minutes to 24 hours or more depending on the data to load. I only want one instance of this job to run at a given time. I am using DisableConcurrentExecution with a 24 hour timeout but for some reason when I run the job with no other instances running, it gets stuck waiting on the distributed lock to finish (even though there should be none as there are no other instances of the job currently running).
I know this is a WIP but does anyone know of a workaround?
Currently using Hangfire Core 1.7.3 with Hangfire.PostgreSQL 1.6.0
Cheers
Truncating the lock table worked for now. Not ideal but works in a pinch.
TRUNCATE TABLE hangfire.lock;
I created a new MaxConcurrentExecutions filter that handles this correctly: https://gist.github.com/henrik-donuts/0e38af74f95ebfce103fd2568d3455f7
Currently using Hangfire Core 1.7.3 with Hangfire.PostgreSQL 1.6.0
Any chances of merging this new filter MaxConcurrentExecutions to core hangfire @odinserj?
Any updates on this? I'm also facing application issues trying to use DisableConcurrentExecution to keep a single instance of a job running regularly. I locked up the entire handfire instance on my production server when the distributed lock issue hit, no jobs could be queued and things were bad, The above link is long broken too.
After reading so many good things about hangfire I found out that something so basic as this is not working out of the box
Any update on when this will be fixed/available in core? The MaxConcurrentExecutions gist link is broken. I'm not sure if MutexAttribute is supported.
What about this one?
https://github.com/alastairtree/Hangfire.MaximumConcurrentExecutions
We've been struggling with that for a while and I'm actually amazed that this hasn't been solved already. We run from 6 to 8 ECS instance and we need to run our DB migration only once in our shared DB.
The fact that there's no out-of-the-box way to say only run this job in one of the instances, and only retry in this one instance if it fails
is really sad...
The MaximumConcurrentExecutions
would have been perfect but all links are broken... Any ideas guys?
The link https://github.com/alastairtree/Hangfire.MaximumConcurrentExecutions still works for me