river Support abandoning jobs that take too long to cancel

Context

Hiya, thanks for making River! Really digging it so far.

I have an app that queues jobs that I don't really need to wait to complete before exiting, even after cancelling. I just want to guarantee fast deploys and let jobs just get tossed around like a hot potato during a rolling deploy.

Before I continue: this isn't blocking me, and it's a nuanced topic on an edge case within an edge case, so no rush.

StopAndCancel is almost what I want. Cancelled jobs end up in a retryable state and they either get picked back up elsewhere or recovered when the instance comes back. The one case it doesn't handle is when jobs take too long to stop (typically by not respecting ctx.Done(), likely a bug).

I tried to pass a timeout ctx to StopAndCancel, but it doesn't currently respect it:

stopCtx, stop := context.WithTimeout(context.Background(), 10*time.Second)
defer stop()
if err := db.workerClient.StopAndCancel(stopCtx); err != nil {
	slog.Error("failed to stop and cancel workers", "error", err)
}
if stopCtx.Err() != nil {
	slog.Warn("timed out cleaning up jobs")
}

With a job that does time.Sleep(time.Minute) the StopAndCancel will just wait the full 1 minute instead of being interrupted after 10 seconds. This seems like a simple bug, happy to submit a PR.

Even if it did respect the timeout, there's a secondary problem: if we bail out at that point, the worker's jobs will be left in a running state. At this point the job is stuck, and if you have uniqueness rules like I do that'll prevent future attempts too. As far as I can tell the only recourse at this point is to delete the job from the database or set its state to retryable.

Proposal

This might be problematic for other use cases, but at least for mine, this would be a big help:

StopAndCancel respects ctx cancellation
When StopAndCancel times out it marks all running jobs on the local worker as retryable

Alternatively, having some way to do (2) independently would be fine. For example I could do a db.workerClient.Abandon(stopCtx) in the if stopCtx.Err() != nil { branch above.

This would allow graceful shutdown in most cases, prevent hanging in the "truly stuck" case, and allow jobs to be retried in all cases except kill -9.

Alternative

The root of the problem here is really the stuck running jobs. I would also be OK with leaving everything as-is and instead just enforcing a stop timeout at the infra layer, letting the app get kill -9ed, and having River automatically detect when a running job is abandoned so things aren't stuck forever.

Nov 27 '23 21:11 vito

Thanks for reporting this @vito! I've fixed your issue (1) above as part of #79. This is something we used to have but broke during a refactor in the push toward release.

The second part is a bit trickier, although I think it's conceptually possible. Internally the Client knows which jobs haven't yet completed, so it's conceivable that we could grab that list and make a final SQL update prior to exiting. It would be challenging to make sure that this is all done safely and that there are no adverse effects or panics if a job happens to finish while we've already given up on it and are marking it as retryable. It's also a matter of when we make that update: do we do so prior to StopAndCancel returning? If its context is already expired, what context do we use for the SQL update given that all existing contexts have already been cancelled?

I do think that given we can't actually halt a running goroutine in Go, this is probably the right design for us in the end, if we can find a reasonable way to make it work cleanly.

Meanwhile, your best bet is to rely on a combination of StopAndCancel with a context timeout, as well as the RescueStuckJobsAfter setting tuned to a duration where you're comfortable considering your jobs to be "stuck" so that they are returned to retryable. This isn't ideal, because if you have jobs that sometimes intentionally run long they will still get "rescued" after that duration and retried again.

Nov 27 '23 23:11 bgentry

@bgentry Oh nice, didn't know about RescueStuckJobsAfter! That combined with #79 works for me, so feel free to close. Considering the complexity you mentioned, a global config that solves the general problem is even better.

Nov 27 '23 23:11 vito

Nice. #79 + RescueStuckJobsAfter should go a long way towards helping here.

I am a little afraid that uncancellable jobs (and by extension jobs that don't die well after a client stops and are rescued an hour later) end up being a major Achilles Heel for people just given how easy it is to write an uncancellable job in Go. Having the client do one final pass where it basically runs the rescuer on all jobs that were cancelled but didn't stop to get their state back in good order for the next time a client starts up might not be the worst idea.

Nov 28 '23 14:11 brandur

The first part of this is resolved as per #79 and will be included in the upcoming release. It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown, so I will leave this issue open to track that additional work. Thanks again for the detailed issue @vito 🙏

Nov 28 '23 15:11 bgentry

It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown,

Yep. On my feature list!

Nov 28 '23 15:11 brandur

@vito We just cut a new release for stopping respecting context cancellation. 0.0.11.

Dec 02 '23 23:12 brandur

@brandur Awesome, thanks for following up! 🙏

Dec 03 '23 00:12 vito