Hangfire icon indicating copy to clipboard operation
Hangfire copied to clipboard

A way to pause/stop jobs from being processed

Open nat1craft opened this issue 10 years ago • 45 comments

Routinely we get in the situation where we want a specific server to temporarily stop processing jobs. We don't want to remove jobs from queuing up or being processed by other servers, just to take a specific server out of the equation.

Example: Sometimes I have a production server and a development server pointing to the same Hangfire database so we can troubleshoot issues. (I know, I know...) I would like to pause the production server (or maybe pause the dev server) from processing jobs and let the other server handle them. And i don't want to restart the servers in order to do that.

Another Example: Maybe we plan to take a server down for maintenance so we pause it from processing new jobs, wait for it to finish current jobs, and then take it down.

Is there way to programmatically stop the site from processing jobs?

nat1craft avatar Oct 08 '14 18:10 nat1craft

You can run the static server instance and call its Stop/Start methods. But currently it is possible only with application logic (no built-in support).

odinserj avatar Nov 17 '14 11:11 odinserj

Good to know, it would be nice to add the button in the admin panel.

guillaumeroy1 avatar Feb 17 '15 14:02 guillaumeroy1

Why are these methods deprecated? We need the exact same functionality. Can anyone confirm if they still work? I read in another post that the functionality was disabled.

BKlippel avatar Nov 13 '15 21:11 BKlippel

Yes, those methods were deprecated. Simply dispose and recreate an instance of the BackgroundJobServer class to use this functionality.

odinserj avatar Nov 14 '15 11:11 odinserj

But disposing the BackgroundJobServer triggers cancellation and aborts cull currently running jobs. We need for jobs to not start and all jobs already executing to complete. This is the only graceful way to prepare a system with long running jobs for shutdown and code update. It's critical for production systems to have a methodology for draining queues.

BKlippel avatar Nov 16 '15 20:11 BKlippel

I still think the on/off button in the admin panel is a must have. When we deploy a new version of the code, I would like that setting to be remember and I would like to restart the jobs when the code is ready to go live.

@odinserj do you think it could be implemented?

guillaumeroy1 avatar Nov 17 '15 16:11 guillaumeroy1

Anything? We would gladly contribute source modifications if they would be accepted, otherwise we're close to just creating our own fork.

Thanks, Brian

BKlippel avatar Dec 10 '15 01:12 BKlippel

I totally agree with @BKlippel

gandarez avatar Dec 11 '15 12:12 gandarez

What's the status of this issue? I had a problem today in production and needed to manually stop the server!!!

gandarez avatar Jan 27 '16 19:01 gandarez

I agree, A Pause/Resume button on the dashboard would help me out greatly. Can you please include this feature. Thanks.

PJTewkesbury avatar Apr 04 '16 15:04 PJTewkesbury

I would like to see this as well.

JefRH1 avatar Jun 16 '16 15:06 JefRH1

I agree with the other posters. If I need to restart the server for some reason there's no graceful way to do it if jobs are in progress.

A way to pause the worker processes, or temporarily set the server to have a count of 0 worker processes would do the trick if it let the existing processing complete.

Has anyone implemented anything to accomplish this on their own?

CREFaulk avatar Jul 06 '16 15:07 CREFaulk

+1

alastairtree avatar Jul 08 '16 14:07 alastairtree

+1

wobbince avatar Jul 08 '16 20:07 wobbince

Here's a crude workaround - Add a wait queue as the first priority and fill it with jobs which do nothing but run until a condition is met. Periodically check for a stored hash value to change and keep running the filler jobs until it does. One filler job per worker. Add an api to pause and unpause processing. If pause state is in the stored hash then generate wait jobs when the service starts; but perhaps they'd still be requeued automatically if they were running at shutdown anyway.

I know, this really is crude. :-) I don't see the queues as more than an attribute or filter in any of the existing classes and the db schema only has queues defined as lists of jobs with nothing for the overall queue attributes.

Alternatively perhaps the worker class could be extended so that workers can be put into a paused state or reduced to 0? There are sleep methods under BackgroundProcessContext but no documentation on what they do.

Edit: The IMonitoringAPI can be used to retrieve server details but changing the WorkersCount does nothing. It was worth a shot. :-)

Final edit: Creating a sleeper class and filling the workers with that did what I needed. I store a pause status in a list and put a flag in cache. The sleepers use Thread.Sleep until the cache flag is gone and then quit. When I restart I re-create the cache flag if the stored flag exists and that does the job. It prevents the database or active processing when I want to shut down at any rate.

CREFaulk avatar Jul 09 '16 14:07 CREFaulk

+1 without being able to let all your workers finish up and then stop taking new jobs, its hard to do a graceful deployment.

markalanevans avatar Jul 28 '16 23:07 markalanevans

+1

gandarez avatar Sep 20 '16 12:09 gandarez

+1

tomasfalt avatar Dec 12 '16 11:12 tomasfalt

+1

dlazzarino avatar Mar 28 '17 09:03 dlazzarino

+1 - this would be very useful in our environment.

danielcor avatar Mar 28 '17 12:03 danielcor

@odinserj This thread seems to suggest there is no consideration in Hangfire for a deployment solution for jobs in progress. I reject this purely on the age and maturity of this platform. What is the official best practise for production deployments and ensuring data consistency?

cottsak avatar Mar 31 '17 05:03 cottsak

But disposing the BackgroundJobServer triggers cancellation and aborts cull currently running jobs. We need for jobs to not start and all jobs already executing to complete. This is the only graceful way to prepare a system with long running jobs for shutdown and code update. It's critical for production systems to have a methodology for draining queues.

@BKlippel I'm not sure I agree. If Hangfire guarantees to run a job "exception free" then any state that would cause a problem if it became inconsistent should be designed with ACID principals in mind, right?

So the way I see it is, that if my jobs are designed with ACID/transaction projection in mind (for those that need it), then if their threads are killed mid-processing then after the deployment, Hangfire will re-queue and execute said job again. In this case the failed/incomplete invocation won't leave an inconsistency because it's been designed not to. Is there still a problem here?

cottsak avatar Mar 31 '17 05:03 cottsak

As for a "Stop/Start" feature or workaround, one might look at the design choice and @odinserj's suggestion to dispose and re-create the BackgroundJobServer as "making this harder" but I don't think so. I think the design reveals that you simply don't need to.

Pausing one job or a specific subset is one thing: use filters or custom logic in your job entry point. But a top-level/BackgroundJobServer "Stop/Start" feature makes no sense if it's to solve the deployment concern. You simply don't need to. Every web node will shut down BackgroundJobServer when recycled and Hangfire will guarantee the re-queuing and invocation of jobs that didn't complete or those that failed. All sensitive state will be designed with ACID principals and so the missing work mid-deployment will be completed after the deployment.

Let's address the OPs original concrete scenarios:

Example: Sometimes I have a production server and a development server pointing to the same Hangfire database so we can troubleshoot issues. (I know, I know...) I would like to pause the production server (or maybe pause the dev server) from processing jobs and let the other server handle them. And i don't want to restart the servers in order to do that.

Here you're wanting to essentially "take one server out of the load" like you might with a load balancer. I believe filters can achieve this. It's not built in but you can do it and filter out a server from consuming from a specified queue.

Another Example: Maybe we plan to take a server down for maintenance so we pause it from processing new jobs, wait for it to finish current jobs, and then take it down.

What does "maintenance" mean or why does "maintenance" require the need to stop processing jobs? Because if it's a deployment scenario we've covered that above. Usually a "maintenance mode" flag is designed to allow the production environment to run but prevent state change/input from regular users. Maybe admins want to try something with no other changes happening at the same time. Well in these cases, I'd say largely that there would only be a subset of background job types that would be affected by this. Many background jobs won't mutate anything while there is no user initiated mutation. However some will. For that subset, use filters or other techniques suggested to "pause" just those. Many folks won't have this category of job at all so I can see why the application of pragmatism has prevented a root level "pause" feature of all jobs.

cottsak avatar Mar 31 '17 05:03 cottsak

We also see the need for a "Stop accepting new jobs" option.

We run Hangfire as a standalone server in a cluster of currently 2 servers, not dependent on iis or asp.net so recycles are of no concern for us and many processes we plan to move to Hangfire are today running as standalone services.

Most are not designed to handle thread abort problems and redesigning them are in some cases impractical due to the nature of the job and graceful shutdown requires the jobs to finish writing processed data (we are calling an external api and need to log the results reliably).

Hangfire kills the threads after 12-18 seconds from calling dispose which can be to little time as we have to wait for the external Api return value and then commit that to the database.

Since our goal is to be able to shutdown the servers in a rolling schedule to never disable the service completely we cannot use any scheduling magic since we cannot prevent the local server from picking new jobs without preventing all other servers from picking them up.

We was thinking on using a hack based on disposing the server without closing the process which allows running jobs to continue, but since this also re queues the jobs directly another server might pickup the job causing concurrent execution which is not wanted.

Currently our only option seems to be to try to reduce the amount of state we keep to minimize data loss in case of shutdown, but we cannot prevent it reliably which is a shame.

dmartensson avatar May 08 '17 12:05 dmartensson

OK, It's been a while, and I will happily tell everyone what we have done to mostly work around these issues.

We now deploy into 2 queue "channels", call the "A" and "B" if you like. This isn't perfect, as you still need to account for the spill over as it were, and not overwrite an active channel that has not finished draining. However, what it does allow is for a significant code change to divert new work to the alternate channel. Our channels are defined in our connection strings, we have a hangfire DB per channel. When we deploy (we use octopus), if we need to update code without interruption of jobs that are currently long running, we deploy to the next (or simply alternate) channel. Prior jobs keep running, new jobs take advantage of the new code. Of course our deployment synchronizes the submission channel of the web servers to that of the background services. We configure our services so they are named by channel and coexist on the relevant app servers. The old channels can then drain, or be shut down.

Do with this as you may, but keep in mind there's not really a pure "exception free" queue. Queue's need to be cancel-able, and drain-able while also being durable to code updates. If we can't pause a queue without interrupting jobs in motion then we need an alternative. Sergey has created something truly awesome here, you're just missing some of the fine print. You have built for "development", but this is ultimately a "devOps" tool. This is competing with the likes of rabbitMQ, that's a big deal. The suggestion for pause is not that jobs would be interrupted, just that the queue would pause being queried and drain. Matt, I appreciate what you are trying to explain, but you aren't considering the consequences of completely restarting jobs when the queue restarts. In very simple cases it's probably not a big deal. But there are many cases introduced by using a product as flexible as this, where updated code will no longer be compatible with the the data of a previously submitted job. In that case ACID fails anyway. You would want the ability to segregate that queue. That's a tall order, so what people have been asking for is more simply the option to pause and drain a queue, holding new submissions (maybe even version them) to a new queue processor to be introduced soon after. I don't know what to say if you don't see this outcome, it's fairly common. My rough workaround solves this, use as many channels as you need, but it would be a nice feature for the queue to support intrinsically.
Ill just conclude with Good Job Sergey, we still love it.

s123klippel avatar May 09 '17 04:05 s123klippel

@s123klippel @dmartensson that's pretty much what we did as well. We centralized a lot of console app scheduled task type batch jobs into hangfire, but many of them are long running (30-40 minutes) worth of data munging. We could rewrite them to support checkpoints, break them into smaller tasks...or do graceful restarts of hangfire. Graceful restarts seemed much easier ;)

ON app startup after deployment we read the deployed 'environment', which can really be any string, we only deploy once or twice a day so it just alternates between blue and green. So if the current deployment is blue the application instance then listens for jobs on the blue queue.

The second step is to only enqueue jobs to the 'active' queue, which is stored in the database and in cache, and toggled after a deployment. Each deployment has 2 instances, for a total of potentially 4 nodes connected to the hangfire database. With hangfire, every node connected to the database will pick up and try to enqueue a 'scheduled' jobs so there has to be a central source of truth for what queue the scheduled jobs go to.

The environment switching, plus deploying to a new server lets us gracefully drain the old deployment before shutting it down and works really well. Only issue we run into is this guy -- https://discuss.hangfire.io/t/failed-can-not-change-the-state-of-a-job-to-enqueued-target-method-was-not-found/122 -- because of the mechanism for enqueing scheduled jobs an old node will try to schedule a brand new job that doesn't exist in the previous codebase and it will fail once or twice before the new deployment picks it up for scheduling.

    Hangfire.GlobalConfiguration.Configuration.UseFilter<HfEnvServerFilter>(new HfEnvServerFilter(()));
    var bjso = new BackgroundJobServerOptions() { Queues =  new List<string>(2) { Environment.MachineName.ToLower(), WebConfiguration.AppSettings["DeployedEnvironment"] }; };
    app.UseHangfireServer(bjso);
    addAndUpdateScheduledJobs()
    public class HfEnvServerFilter : IElectStateFilter
    {
        //IEnvironmentSwitcher can pull the 'active' environment from cache and db, it is toggled after a new deploy
        private IEnvironmentSwitcher _envSwitcher;
        public HfEnvServerFilter (IEnvironmentSwitcher envSwitcher)
        {
            _envSwitcher = envSwitcher;
        }

        public void OnStateElection(ElectStateContext context)
        {
            //Flaky Cache? Falling back to current deployed environment in those situations.
            var activeEnvironment = (_envSwitcher.TryGetActiveEnvironment() ?? _envSwitcher.TheCurrentEnvironment).ToString().ToLower();            
            
                var enqueuedState = context.CandidateState as EnqueuedState;
            if (enqueuedState == null)
                return;
           
            //support our custom queue name attribute
            var queueNameAttributes = context.BackgroundJob.Job.Method.DeclaringType.CustomAttributes
                .Union(context.BackgroundJob.Job.Method.CustomAttributes)
                .Where(attr => attr.AttributeType == typeof(Utility.Hangfire.RunOnQueueAttribute))
                .SelectMany(attr => attr.NamedArguments)
                .Where(arg => arg.MemberName == "QueueName");

            if (queueNameAttributes.Any()) {
                enqueuedState.Queue = queueNameAttributes.Last().TypedValue.Value.ToString();
            }
            else
                enqueuedState.Queue = activeEnvironment;
        }
    }

blyry avatar May 09 '17 13:05 blyry

@blyry That does not solve how to trigger the cancellation gracefully (not killing threads) and also, jobs that are requeued by hangfire according to this bug https://github.com/HangfireIO/Hangfire/pull/502 always get queued on the default queue and would not be picked up at all then?

dmartensson avatar May 09 '17 15:05 dmartensson

The filter I posted takes care and seems similar to the workaround proposed in #502. And you're right, we don't solve the graceful cancellation problem, but our dual deployment made it unnecessary to gracefully cancel anything. Old deployments run until they are empty and then they are killed.

Sometimes deployments still get recycled / iisreset, for sure this doesn't solve the problem or need for a better graceful shutdown mechanism, but it's been an acceptable work around for us. Basic graceful shutdown support would either have 2 flavors, right? 1) shutdown when finished processing, or 2) shutdown as soon as possible. So we technically support 1, but not 2.

blyry avatar May 09 '17 20:05 blyry

Think I found a solution to graceful shut-down.

There appears to to exist a new replacement to stop/start in "sendstop" on the service instance object.

I have tested it and it sets the cancellation token but does not force running threads to abort.

It also makes the local server instance to not pick up new jobs.

So doing this and then waiting until the server instance has no running jobs should make a graceful shut-down possible.

Is my assumptions on sendstop correkt or am I missing something.

Sendstop is not marked as deprecated.

dmartensson avatar May 15 '17 16:05 dmartensson

nice good find! Added in 1.6.0, so available from Jul 15, 2016. Googling for that led me to this thread -- https://discuss.hangfire.io/t/ability-to-stop-all-the-server-instances-during-deployments/2285 where @odinserj explained about it and sortof what we're trying to do here.

Is it just as easily to start the server back up? Do you have to call dispose after sendStop, or can you call Start at a later date on the same server instance and everything works fine?

blyry avatar May 15 '17 19:05 blyry