solid_queue icon indicating copy to clipboard operation
solid_queue copied to clipboard

"inprogress" jobs aren't actually executing after a worker terminates abnormally

Open Ed1lan opened this issue 8 months ago • 6 comments

Hi! I am working with SolidQueue in a project and recently I have encountered a problem, the mission_control-jobs gem shows as some jobs are running, but when checking the system processes they do not exist.

Investigating further what could be happening, I have discovered some things. I don't know why, but the workers shutted down and started automatically and tried to reclaim the jobs that were running at that moment, but this failed showing the following logs:

Image

I have read the documentation and I have seen that it talks about the case in which “someone pulls the cable”, and checked that indeed, as mentioned, the jobs that were running at that time are in the SolidQueue::ClaimedExecution table and if I check the current status of each of those jobs marks as “inprogress”, but when checking the associated SolidQueue::Process, it does not exist.

I ran a little code to show up the status of the jobs claimed by processes that do not exists to show this

Image

Image

Image

Trying to see how it behaved, I modified the “finished_at” field of one of the jobs, which made it to be marked internally as :finished, but it still appears in the list of inprogress jobs of mission_control and it is still in the SolidQueue::ClaimedExecution table.

Image

Additionally, if I inspect that job is specific, it now comes up as if it is :finished.

Image

How can I release these jobs correctly? How can I prevent this from happening? Am I missing something? For the moment I planned to set manually those jobs as finished setting the finished_at date and removing them from SolidQueue::ClaimedExecution table, but I would like to know how could I prevent this from happening or if this is a bug

PD: Sorry if my english is poor

Ed1lan avatar Mar 25 '25 15:03 Ed1lan

Hey @Ed1lan, sorry about that! Those jobs should be released automatically (marked as failed) when the supervisor starts next time. Is this happening in development only? Was your computer going to sleep or something when all those workers died?

rosa avatar Mar 25 '25 15:03 rosa

Hey @rosa, thank you for your fast reply! This happened on production environment and in that moment the server was getting a backup snapshot going on. Maybe the backup shutted down the workers? But after that they didn't got marked as failed as they supposed to

Ed1lan avatar Mar 25 '25 16:03 Ed1lan

Maybe the backup shutted down the workers? But after that they didn't got marked as failed as they supposed to

Hmm no, that shouldn't be related 🤔 They could have crashed or something, but that shouldn't happen 🤔

I just noticed, in your first screenshot above, that the code that should have marked your in-progress jobs as failed did run, these are the lines that say:

Fail claimed jobs (..) job_ids: ... 

and a list of job IDs. You should see similar lines for the jobs in your other screenshots, the ones in progress for which the process doesn't exist. That didn't happen?

rosa avatar Mar 25 '25 19:03 rosa

No, it didn't. Those jobs stayed in-progress and claimed as shown in the others screenshots.

Ed1lan avatar Mar 26 '25 07:03 Ed1lan

No, it didn't. Those jobs stayed in-progress and claimed as shown in the others screenshots.

And they didn't get released when you restarted the supervisor? Releasing in-progress jobs happens automatically at start, and from that log line above, I know it's happening correctly in your case.

rosa avatar Mar 31 '25 07:03 rosa

It usually does release the jobs, but it didn't happen this time and those jobs kept claimed for two days until I manually released them, restarting solid_queue or the server itself didn't help

Ed1lan avatar Apr 01 '25 11:04 Ed1lan

Hi everyone, we're experiencing the same kind of issue on our side as well.

If it helps, I’d be happy to share some context or logs to help narrow it down.

Have you had any updates or progress?

AlessandroTolomio avatar May 06 '25 13:05 AlessandroTolomio

Hey @AlessandroTolomio sorry for the delay, we added this lines to the service that executes solid_queue

ExecReload=/bin/kill -TSTP $MAINPID
ExecStop=/bin/kill -TERM $MAINPID

and to the date, we didn't experience the issue again. Hope it helps you or someone in the future!

Ed1lan avatar May 21 '25 08:05 Ed1lan

Going to close this one.

rosa avatar Jun 16 '25 20:06 rosa