solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

Need to retry jobs that died with ProcessPrunedError

Open nhorton opened this issue 5 months ago • 5 comments

We have some situations where we get ProcessPrunedErrors for jobs that were on killed workers (in Kubernetes). We need a way to have those retried.

on_thread_error seems like it might be appropriate, but it is very unclear what the lifecycle would look like using it since the examples are all just for error captures.

nhorton avatar Jun 26 '25 20:06 nhorton

Oh, yes... on_thread_error is intended for Solid Queue's internal errors and wouldn't fit for this.

In theory, you should only get process pruned errors when a process dies unceremoniously (killed, like you said). On some occasions, there's something going on in the job that causes the process to be killed (eg. memory leak in the job that causes the OS to kill the process), so you don't want to retry these generally.

If your workers are getting killed frequently, could you address this instead of automatically retrying these jobs? Is there a way to signal proper termination to the workers instead of relying on the pod going away or something similar?

rosa avatar Jun 26 '25 21:06 rosa

@rosa - sorry for the slow reply.

We are 100% trying to fix the problem itself. That said, it is non-trivial to do so. We are in a containerized environment, and that pretty much always means that you get processes OOMKilled if they exceed their size. It is also REALLY hard to tell what ActiveJob job instance is causing the issue because the whole pod gets killed, and it is hard to even know what jobs were on it at the moment.

(as an aside) In a perfect world, the supervisor would dump some rich log info when it saw a worker process die, like details of all the jobs that were in flight. That would help with fixing this.

But regardless of everything else, I would assert that the inability to do this is a massive flaw in SolidQueue. The general semantic that queues provide is executing each job at least once, and it is essentially failing to do that. As it stands, critical tasks like running billing or such could just not happen because of an infrastructure failure or such - the exact thing people use queues to avoid.

I would suggest that there are two things you could do that would make sense, and at least one really should happen:

  1. Automatically re-enqueue any ProcessPruned jobs.
  2. Add a hook for ProcessPruned jobs such that when the other process that notices the dead jobs updates them in the DB to be ProcessPruned, it instead / in-addition calls the hook and we can re-enqueue ourselves.

nhorton avatar Jul 18 '25 16:07 nhorton

like details of all the jobs that were in flight

But these jobs are precisely the ones that fail with a ProcessPrunedErrors.

The general semantic that queues provide is executing each job at least once, and it is essentially failing to do that.

How so? It is running the job at least once, but something external is killing the process.

As it stands, critical tasks like running billing or such could just not happen because of an infrastructure failure or such - the exact thing people use queues to avoid.

This is not true. If your infrastructure is down, queues like this, persisted in DB, so you can recover the jobs later when your infrastructure recovers. Queues can't work when your infrastructure is down. I also don't quite agree with that being the reason people use background jobs. I think the main reason is simply to run tasks that are too long to run within a request in the background.

rosa avatar Jul 18 '25 17:07 rosa

like details of all the jobs that were in flight

But these jobs are precisely the ones that fail with a ProcessPrunedErrors. That is fair - it is just difficult to reconstruct the details of it from the failed jobs.

The general semantic that queues provide is executing each job at least once, and it is essentially failing to do that.

How so? It is running the job at least once, but something external is killing the process. It is guaranteed to have dequeued the job at least once. Many of these jobs could have barely started.

I do want to back off my statement a little though; a better phrasing is that the strong recommendation in many places to make jobs idempotent so that they can be rerun is largely there to make "re-enqueue by default" safe. Sidekiq definitely re-enqueues and I thing Good job does too, though Resque I think does not. Kubernetes jobs do it. I think Celery does too.

As it stands, critical tasks like running billing or such could just not happen because of an infrastructure failure or such - the exact thing people use queues to avoid.

This is not true. If your infrastructure is down, queues like this, persisted in DB, so you can recover the jobs later when your infrastructure recovers. Queues can't work when your infrastructure is down. My infrastructure is not down - a single node went down. This is a common occurrence in a cloud world - one assumes nodes die regularly.

But step back on the flow suggested here and look at it from a higher level:

  1. A SolidQueue node is up and running - that is what is noticing the orphaned jobs and moving them to another queue. So by definition the queueing system is up at this point.
  2. There is a node that is doing the work of updating the state of the job in the DB already.
  3. I am being forced to have alerting for jobs with this status so that humans can go look at them and hit the re-enqueue button in MissionControl.

Even if you don't want to re-enqueue these, why not give a hook where engineers can automate the behavior in step 2 instead of routing everything through step 3? I am happy to do the re-enqueue on my own for this.

I also don't quite agree with that being the reason people use background jobs. I think the main reason is simply to run tasks that are too long to run within a request in the background.

If they are really long running, SolidQueue and similar are not great for it because it does not play well with process lifecycle stuff for purposes like deploys. We have a lot of multi-hour tasks and run it with ActiveJob Kubernetes for that reason.

Beyond that, I agree that it is used that way largely in the Rails ecosystem, but Rails is also a little odd that way. Queueing in general is really for either when you have variable loads and you need to normalize out the load across time (going back as an operations management idea from before CS), or you want a different execution semantic like resilience. Part of where I think Rails has lost ground to other languages like Python is on this point really - ex it is so easy to write code for FastAPI where many of the requests take minutes to complete, but you just don't care because stacking 500 simultaneous requests in a process when they are all waiting on io is fine. I.e. everything involving LLM requests or heavy data processing. Don't get me wrong - I love Rails - but its norms do make the long-running stuff harder.

But that is all a digression - I just want to be able to hook on the event of these lost jobs being detected. Would you be open to a PR in this area?

nhorton avatar Jul 20 '25 05:07 nhorton

"re-enqueue by default" safe. Sidekiq definitely re-enqueues and I thing Good job does too, though Resque I think does not. Kubernetes jobs do it. I think Celery does too.

I'm not sure Sidekiq re-enqueues jobs when it's hard-killed, I think the Pro version allows you to recover jobs of this kind in a special way described here, with a special handling of jobs that were recovered three times to avoid poison pills. Solid Queue's behaviour was precisely intended to avoid poison pills.

If they are really long running, SolidQueue and similar are not great for it because it does not play well with process lifecycle stuff for purposes like deploys. We have a lot of multi-hour tasks and run it with ActiveJob Kubernetes for that reason.

That's not a problem for Solid Queue. In-flight jobs during deploys are re-enqueued automatically, as long as you're sending a TERM signal and configuring your desired shutdown timeout. We also have a lot of long-running jobs, many multi-hour (we use Active Job continuations to resume those if needed), deploy many times every day and never get jobs failed with ProcessPrunedError. That has been very rare.

But that is all a digression - I just want to be able to hook on the event of these lost jobs being detected. Would you be open to a PR in this area?

It'd need to be done in a way that avoids poison pills (as that'd be much worse than having to retry jobs manually) and that doesn't introduce a lot of complexity because it'd handle a case that should never happen in the first place.

rosa avatar Jul 20 '25 06:07 rosa