Hangfire.PostgreSql icon indicating copy to clipboard operation
Hangfire.PostgreSql copied to clipboard

Issue Job Enqueued count incorrect/mismatch

Open yadurajshakti opened this issue 1 year ago • 5 comments

Hi Team,

We are facing this issue of enqueued count incorrect/mismatch as you can see in the screenshot below. We could not find any solution in hangfire forum or github yet. Please help us in providing explanations on this behavior and possible fix.

image

We are using .Net Core 7.0 and PostgreSQL. Hangfire package and version used in the solution are: "Hangfire.AspNetCore" Version="1.8.0" "Hangfire.Core" Version="1.8.0" "Hangfire.PostgreSql" Version="1.19.12"

One observation is: The job is moved from the queue, but its state not changed to either success or failed. image

yadurajshakti avatar Oct 15 '24 04:10 yadurajshakti

Would be good to provide at least the configuration/setup done for Hangfire. Best case would be a reproducible minimal case.

azygis avatar Oct 15 '24 05:10 azygis

Hi @azygis

We have done the similar setup as in official website https://docs.hangfire.io/en/latest/getting-started/aspnet-core-applications.html

The dashboard is a separate web application, and we have a console app to schedule the job.

image

Dashboard Configuration:

public void ConfigureServices(IServiceCollection services)
{
    // Add Hangfire services.
    services.AddHangfire(configuration => configuration
        .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
        .UseSimpleAssemblyNameTypeSerializer()
        .UseRecommendedSerializerSettings()
        .UseSqlServerStorage(Configuration.GetConnectionString("HangfireConnection")));  
}

Then in app configuration

public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
app.UseHangfireDashboard(
    "/hangfire",
    new DashboardOptions
    {
        Authorization = new[] { new AuthorizationFilter() },
        AppPath = "/hangfire/home/index"
    },
    new PostgreSqlStorage(Configuration.GetConnectionString("HangfireConnection"))
);
}

Engine configuration responsible for schedule and executing the jobs:

ConfigureHangFire(HostBuilderContext hostContext, IServiceCollection services)
{
            GlobalConfiguration.Configuration.UsePostgreSqlStorage(connectionString, new PostgreSqlStorageOptions
            {
                DistributedLockTimeout = TimeSpan.FromMinutes(5),
                InvisibilityTimeout = TimeSpan.FromMinutes(20)
            });
            GlobalConfiguration.Configuration.UseSimpleAssemblyNameTypeSerializer();
            GlobalConfiguration.Configuration.UseRecommendedSerializerSettings();
            GlobalJobFilters.Filters.Add(new AutomaticRetryAttribute { Attempts = 3 });

            var options = new BackgroundJobServerOptions
            {
                WorkerCount = 1
            };
            
}

yadurajshakti avatar Oct 21 '24 05:10 yadurajshakti

Your dashboard configuration is using SQL Server integration. Are you sure you copied your own configuration and not the one from Hangfire docs?

azygis avatar Oct 22 '24 05:10 azygis

Hi @azygis Here is the dashboard configuration from my application. We are using UsePostgreSqlStorage(..) method only.

image

yadurajshakti avatar Oct 23 '24 03:10 yadurajshakti

Can you check the lock table if it has entries? I feel it may have a lock placed there and the application exited in a "kill" fashion which prevented from clearing the locks. Nothing picking up the job for 17 days is usually related to zombie locks.

azygis avatar Oct 23 '24 05:10 azygis

Thanks @azygis This is happening mostly in our UAT/QA environments where the deployment frequency is higher. Please can you suggest a mechanism to handle such cases.

  • Should we manuly stop hangfire jobs/engine/services during deployment?
  • Do you have any implementation to reset locks and other tables during deployments.

yadurajshakti avatar Nov 04 '24 02:11 yadurajshakti

First of all, do not kill the process. Instead of SIGKILL, send SIGTERM when you are stopping the application. I do not know what you use for stopping it hence that's the first suggestion. Proper shutdown sends cancel request to the processing jobs, as long as you use cancellation tokens. If the cancellation doesn't complete in some seconds (sorry, can't remember, it's like 5 or 10s or some other number) the jobs get terminated. Locks are released right before application stops. If application is straight up killed there's no way for Hangfire to exit cleanly.

As for stopping the engine, that is still not really possible without nasty workarounds. That is not possible to handle by storage provider (which is this repository/library). We cannot clear the locks on startup or anytime random because then we might end up with a broken state.

What we did at work is add an endpoint to the applications to check whether any jobs are enqueued/processing, and if so, wait until all of them are complete before continuing with swarm deployment. Waiting for us is fine, YMMV.

azygis avatar Nov 04 '24 07:11 azygis