Hangfire icon indicating copy to clipboard operation
Hangfire copied to clipboard

Implement an alternative to DisableConcurrentExecution attribute

Open odinserj opened this issue 7 years ago • 18 comments

The implementation is crappy for long-running jobs, because it requires a long-running distributed lock. And it's totally unclear what timeout to use, because it should cover the maximum possible duration of a background job. The following problems arise, when using high values for timeouts (consider 4 hours, for example):

  1. Any worker that fetched a job protected by this attribute will wait, and will wait for 4 hours in our example. Have multiple attributes – the whole processing could be stopped.
  2. For very high values, like 24 hours, resource governors may simply release the distributed lock, and we'll get 2 or more concurrent executions depending on the job duration.

If we are using low values, we could end up with permanent DistributedLockTimeoutException or TimeoutException depending on the version. There's the MutexAttribute that was implemented as a gist, and it is using short distributed locks, but its implementation could be greatly simplified by using a single hash value instead of sets.

So if someone wants to deep dive into Hangfire internals and extensibility points, this is a good entry point!

odinserj avatar Nov 09 '17 23:11 odinserj

Ah, and DisableConcurrentExecution should also be marked with the ObsoleteAttribute.

odinserj avatar Nov 09 '17 23:11 odinserj

Strange. This mutex attribute doesn't block any concurrent jobs in my simple .net core 2.0 test case.

capture1

as shown the following 2 jobs should be blocked by the first running job and placed in the scheduled queue intead. However, it didn't happen.

here is the relevant codes:

var recurringJobManager = new RecurringJobManager();
            recurringJobManager.AddOrUpdate("LongRunningTest", Job.FromExpression<HousekeepingService>(j => j.LongRunningTest()), jobCronSetting.HousekeepingCronExpression);

[Mutex("LongRunningTest")]
 public void LongRunningTest()
{
      _logger.LogDebug("Thread {threadId}: test starts.", Thread.CurrentThread.ManagedThreadId);
      Thread.Sleep(15000);
      _logger.LogDebug("Thread {threadId}: test ends.", Thread.CurrentThread.ManagedThreadId);
 }

Any thoughts?

zhenmin-peng avatar Jun 25 '18 15:06 zhenmin-peng

@alphaeleven, what storage are you using and what version of Hangfire.* packages do you have? Also, are you using interfaces for your job (e.g. IHousekeepingService)?

odinserj avatar Jun 25 '18 16:06 odinserj

@odinserj Thanks for your quick reply.

<PackageReference Include="HangFire" Version="1.6.17" />
<PackageReference Include="Hangfire.MemoryStorage" Version="1.5.2" />
<PackageReference Include="Hangfire.Pro" Version="2.1.0" />
<PackageReference Include="Hangfire.Pro.Redis" Version="2.2.1" />

I used MemoryStorage for this test case. Does it matter?

Yes. there was an IHousekeepingService and I tried putting mutex on the interface, but didn't work either. So trying to keep debug process eaier, the interface is not in use any more.

zhenmin-peng avatar Jun 25 '18 18:06 zhenmin-peng

Upgraded Hangfire to 1.6.19, but issue still persists.

zhenmin-peng avatar Jun 26 '18 15:06 zhenmin-peng

Have you tried to reproduce it when using Hangfire.Pro.Redis or Hangfire.SqlServer packages instead? The locking feature heavily depends on the implementation of locks inside a storage package. If it's not implemented properly, the feature will not work.

odinserj avatar Jun 26 '18 15:06 odinserj

@odinserj as you suggested it works only on Hangfire.Pro.Redis and Hangfire.SqlServer but not Hangfire.MemoryStorage. Really appreciate your help on sorting this out.

image

image

zhenmin-peng avatar Jul 27 '18 09:07 zhenmin-peng

Oops, actually my problem was that I was applying the attribute to the concrete method, but calling the method via the interface in the BackgroundJob.Enqueue<IMyInterface>(...) call. I had to apply the attribute to the method on the interface and then it started working.

kemmis avatar Aug 06 '18 12:08 kemmis

Hi @odinserj I am dealing with the exact behavior that you mentioned in your initial post.

I have a long-running job that can take anywhere from 30 minutes to 24 hours or more depending on the data to load. I only want one instance of this job to run at a given time. I am using DisableConcurrentExecution with a 24 hour timeout but for some reason when I run the job with no other instances running, it gets stuck waiting on the distributed lock to finish (even though there should be none as there are no other instances of the job currently running).

I know this is a WIP but does anyone know of a workaround?

Currently using Hangfire Core 1.7.3 with Hangfire.PostgreSQL 1.6.0

Cheers

ArcadeRenegade avatar Jun 04 '19 06:06 ArcadeRenegade

Truncating the lock table worked for now. Not ideal but works in a pinch.

TRUNCATE TABLE hangfire.lock;

ArcadeRenegade avatar Jun 04 '19 07:06 ArcadeRenegade

I created a new MaxConcurrentExecutions filter that handles this correctly: https://gist.github.com/henrik-donuts/0e38af74f95ebfce103fd2568d3455f7

Currently using Hangfire Core 1.7.3 with Hangfire.PostgreSQL 1.6.0

henrik-donuts avatar Jun 24 '19 20:06 henrik-donuts

Any chances of merging this new filter MaxConcurrentExecutions to core hangfire @odinserj?

staticdev avatar Jun 27 '19 17:06 staticdev

Any updates on this? I'm also facing application issues trying to use DisableConcurrentExecution to keep a single instance of a job running regularly. I locked up the entire handfire instance on my production server when the distributed lock issue hit, no jobs could be queued and things were bad, The above link is long broken too.

IronSean avatar Nov 14 '19 20:11 IronSean

After reading so many good things about hangfire I found out that something so basic as this is not working out of the box

emumanu avatar Nov 15 '19 00:11 emumanu

Any update on when this will be fixed/available in core? The MaxConcurrentExecutions gist link is broken. I'm not sure if MutexAttribute is supported.

brendonwm avatar Mar 24 '20 05:03 brendonwm

What about this one?

https://github.com/alastairtree/Hangfire.MaximumConcurrentExecutions

agausachs avatar Jun 04 '20 12:06 agausachs

We've been struggling with that for a while and I'm actually amazed that this hasn't been solved already. We run from 6 to 8 ECS instance and we need to run our DB migration only once in our shared DB.

The fact that there's no out-of-the-box way to say only run this job in one of the instances, and only retry in this one instance if it fails is really sad...

The MaximumConcurrentExecutions would have been perfect but all links are broken... Any ideas guys?

yannlairdc avatar Aug 10 '22 23:08 yannlairdc

The link https://github.com/alastairtree/Hangfire.MaximumConcurrentExecutions still works for me

alastairtree avatar Aug 11 '22 06:08 alastairtree