pg-boss icon indicating copy to clipboard operation
pg-boss copied to clipboard

Implement job heartbeat

Open kibertoad opened this issue 3 weeks ago • 7 comments

fixes #436 fixes #579

kibertoad avatar Dec 02 '25 20:12 kibertoad

Coverage Status

coverage: 99.913% (+0.005%) from 99.908% when pulling 86a75436ac29a83c5d5d0da84e014b3ab6d2720d on kibertoad:feat/heartbeat into 3da860f0e6f0650dcb95f62e5b71af6dfbeb44f1 on timgit:master.

coveralls avatar Dec 02 '25 21:12 coveralls

First of all, I'm very appreciative of your time and effort on this. I'm still reviewing, but I wanted to add a couple of quick observations for thought and discussion.

This feature is better defined as "expiration extension" than "job heartbeat". If the job expiration is set low enough to force extensions more often, it starts to fit more into a typical health check heartbeat feature. For example, if we call it "heartbeat" and a worker dies the 1st minute after fetching a job with a 1 hour expiration, that job would not be detected for retry until an hour later, which doesn't seem to match its name.

Since I would assume most would prefer the practical usefulness of "get health checks for free with lower a expiration", it seems the default expiration of 15 minutes is simply far too high to be considered well-aligned with this feature. If we were to try and switch this into "opt out" instead of "opt in" later, we'd have to reduce the default expiration as well (more close to Bull's 30s, for example).

This leads to another potential issue I want to be thoughtful about, since each heartbeat results in an update query that will trigger more MVCC activity. The bias I have is "the less queries we send to Postgres the better". If someone were to set values too low on the interval, it may trigger more db activity that we'd prefer and actually cause perf issues. So not only should we guard for values too low that would trigger too many queries, I'd also want a maximum value enforced. Currently, there is a configuration check for expiration not to exceed 24 hours, but how much further should we allow this feature to extend beyond this?

timgit avatar Dec 02 '25 23:12 timgit

@timgit Which limits would make sense to you?

I'll do the renaming tomorrow.

kibertoad avatar Dec 02 '25 23:12 kibertoad

@timgit I've updated the naming, let me know what sensible limits would work for you, and I'll apply that change as well.

kibertoad avatar Dec 03 '25 17:12 kibertoad

After more thought, I don't think mutating expiration is the right way to handle this. Since we're still going to require a maximum time, expiration seems like the most natural fit. At a minimum, however, it seems we won't be able to avoid adding a heartbeat_on timestamp that the worker is responsible for, along with its own monitoring to detect failure and retries. This would restore the original "heartbeat" API and semantics. I was hung up on trying to avoid a schema migration, but it seems like the best option at this point to reduce confusion about what's happening.

Should we allow expiration to be extended? Maybe. I don't have a strong opinion on that yet. If heartbeat becomes a thing, that seems to be the most valuable addition to the package in my opinion.

timgit avatar Dec 04 '25 16:12 timgit

@timgit I'm not sure I fully understand the mechanism you are proposing. Is the idea that expiration date is going to delimit the maximum possible execution time, at which point job either way gets terminated, but it can also be terminated earlier, in case worker misses any of more frequently happening heartbeats?

So for long running jobs you would set expiration time to 2 hours, and heartbeat to like 10 minutes.

kibertoad avatar Dec 05 '25 00:12 kibertoad

Yes, missing heartbeats would trigger a fail/retry to another worker. And, the heartbeat interval would need to be lower, like 30-60s max, not derived from the expiration/max time

timgit avatar Dec 05 '25 01:12 timgit