Hangfire icon indicating copy to clipboard operation
Hangfire copied to clipboard

who and when will server send/receive "Server rd0003ff2199ca:13116:03c6de91 caught stopping signal"

Open ChangyeWei opened this issue 4 years ago • 9 comments
trafficstars

hangfire version 1.7.18 .NetCore 3.1 deploy on k8s Our service will have 10+ hour job need to run. It will got several remove server error when job running. Not sure if the switch and stop happen when only have short job.

I can find the log of server *** caught stopping signal. So my question is why the service is going to stop? how can I prevent the server from switch/stop?

Winbackissue

ChangyeWei avatar Jun 23 '21 07:06 ChangyeWei

we start receiving the same stopping signal after 12 hours run in a couple of servers.

I guess this is because of the application pool recycling..

zhuweid avatar Jul 13 '21 00:07 zhuweid

@zhuweid Our service run on Linux, which would not have any pool recycling issue. I look into the code and found hangfire will remove the server when it found heartbeat timeout. Default setting is 5 min and every 30s will have an heartbeat check. I will try to add some log to indicate its remove server due to the persistent heartbeat check fail.

ChangyeWei avatar Jul 14 '21 05:07 ChangyeWei

interesting, we do observe a similar issue in our testing environment, where a long running job (12+ hours) failed after a few hours.

after some investigation, we found our SQL DB was rebooted due to a SQL patch at the time, which caused SQL to be down for nearly 8 minutes. Since default heartbeat timeout is 5 mins, then the hangfire server got removed.

The logging looks like: .. 7/14/2021, 2:25:29.585 AM Server **056q:7732:d21c943a heartbeat successfully sent 7/14/2021, 2:31:18.495 AM 4 servers were removed due to timeout 7/14/2021, 2:31:23.866 AM Server **056q:7732:d21c943a was considered dead by other servers, restarting... 7/14/2021, 2:31:23.867 AM Server **056q:7732:d21c943a caught restart signal... 7/14/2021, 2:31:23.870 AM Server **056q:7732:d21c943a stopped non-gracefully due to ServerWatchdog, 7/14/2021, 2:31:23.900 AM Server **056q:7732:d21c943a successfully reported itself as stopped in 11.5192 7/14/2021, 2:31:23.900 AM Server **056q:7732:d21c943a has been stopped in total 15.7445 ms ...

zhuweid avatar Jul 14 '21 06:07 zhuweid

Similar question was also asked here https://discuss.hangfire.io/t/idle-server-keeps-restarting-considered-dead-by-other-servers/8795

and I am running into this "Server was considered dead by other server, restarting.." issue as we speak.

Can we get more information on this? @odinserj

Hangfire server that keeps restarting is configured to give out heartbeat at 2 mins but is running long-running jobs (that might run for more than 2 minutes).

Where/what is the setting that configures the "considered dead time" for sibling servers?

mhkolk avatar Jun 12 '25 19:06 mhkolk

You can tune BackgroundJobServerOptions.ServerTimeout to a value higher than the default (5 minutes) to handle longer periods of connectivity problems between processing servers and their storage. Unfortunately, you'd need to have the same value for all of your background job servers, because it's not being persisted to the storage, and it's not possible to set individual timeout for a specific server.

Higher value will increase resiliency to network unavailability problems, but will also increase the time required for background jobs to be re-queued in case of an unexpected shutdown in some storages. Lower values will decrease the re-queue time, but will also decrease resiliency to network problem. So it's a tradeoff.

odinserj avatar Jun 13 '25 08:06 odinserj

Thank you for this explanation however I was experiencing "considered dead by other servers" issue much before the 5 minute default you mentioned. I was aware that a default might be in place on other server that's why I configured the faulting server to send out heartbeats at 1 or 2 minutes and it made no difference, other server still elected to send a restart request.

mhkolk avatar Jun 13 '25 08:06 mhkolk

What version of Hangfire.Core you are using, and what storage package (and its version)?

odinserj avatar Jun 13 '25 09:06 odinserj

That would be 1.8.17 but it is running at 170 schema version using the Hangfire.PostgreSql v.1.20.10

mhkolk avatar Jun 13 '25 09:06 mhkolk

The Server was considered dead by other server, restarting.. message occurs when getting BackgroundServerGoneException when calling the Heartbeat method of a storage connection. Hangfire.Postgresql uses database server's clock as the time authority for heartbeats, and calculating timeouts, so clock synchronization issues on different servers shouldn't cause the issues.

However, it's possible that clocks jumped forward due to clock synchronization issues on a single server, and this might cause such timeouts. But this should happen rarely enough. How often do you see such an issue?

odinserj avatar Jun 13 '25 09:06 odinserj