kue icon indicating copy to clipboard operation
kue copied to clipboard

"Queued" jobs not processed after worker connects

Open olalonde opened this issue 9 years ago • 12 comments

screenshot 2016-08-03 14 07 05

I've got a few jobs stuck in "Queue". The worker won't seem to process them unless I click the refresh icon. Wonder why this is and if it's possibly a bug with Kue.

olalonde avatar Aug 03 '16 21:08 olalonde

I had this for a backlog of 1000+ jobs, I ended up writing a script to clear them all out and restarted the server.

Not sure what caused it to happen but there were a number of jobs created simultaneously (in error) and each of them is long running.

sample code to start cleaning stuff up in redis (please disregard the sp0n stuff it's from private repo):

'use strict';

const config        = require('sp0n-config');
const logglyToken   = config.Loggly.Token;
const subdomain     = config.Loggly.Subdomain;
const logInitialize = require('sp0n-logger').init;
const kue           = require('kue');
const url           = require('url');
const redis         = require('redis');
const util          = require('sp0n-util');

var getRedis = () => {
  // if need to clean up locally process.env.LOCAL_REDIS_URL
  var redisUrl = url.parse(config.Redis.RedisUrl)
    , client = redis.createClient(redisUrl.port, redisUrl.hostname);
  // console.log({ action: 'app', redisUrl: redisUrl });
  if (redisUrl.auth) {
    client.auth(redisUrl.auth.split(":")[1]);
  }
  return client;
}

var isInt = (val) => {
  return typeof val === 'number' && Math.floor(val) === val;
}

var queue = kue.createQueue({
  redis: { 
    createClientFactory: getRedis
  }
});

queue.inactive( function( err, ids ) { // others are active, complete, failed, delayed
  const sAction = 'queue.inactive';
  console.log({ action: sAction, ids: ids });
  for (let id in ids) {
    kue.Job.remove( ids[id], (err) => {
      if (err) {
        console.error({ action: sAction + '.err', id: ids[id], err:err });
      }
    });
  }
});

// another way to go
// queue.Job.rangeByState( 'inactive', 0, 10000, 'asc', function( err, jobs ) {
//   for (let job in jobs) {
//     kue.Job.remove( jobs[job].id, (err) => {
//       if (err) {
//         console.error({ action: sAction + '.err', id: ids[id], err:err });
//       }
//     });
//   }
// });

victusfate avatar Aug 19 '16 16:08 victusfate

Thanks for the script. I added queue.watchStuckJobs(5 * 1000) and problem hasn't re-occurred so far.

olalonde avatar Aug 19 '16 23:08 olalonde

We ran into again. It appears to occur whenever a large number of jobs are queued in a short time interval. I'll try to focus on a test case locally and determine where its getting stuck.

@behrad mentioned some changes with the next release, and they may have some effect so I'll test latest if I can reproduce, and see if that resolves the stuck in queued state issue.

victusfate avatar Aug 25 '16 19:08 victusfate

Yeah, reproducing the bug would be helpful. Additionally, something super useful would be a CLI vs the web dashboard or the redis REPL. kue list kue retry failed kue clear failed etc. Might work on this when I have time.

olalonde avatar Aug 25 '16 19:08 olalonde

I didn't even think redis repl, cool.

victusfate avatar Aug 25 '16 19:08 victusfate

Reviewing tips for preventing stuck queues: https://github.com/Automattic/kue#prevent-from-stuck-active-jobs

This could definitely be the cause as I'm doing media processing and all kinds of interesting errors can arise. I'll go with the domain wrapper or promise setup (all the rest of the code I'm using is promises).

hmm I'd already been doing something like this

    queue.process(this.type, this.concurrency, (job, done) => {
      this.fWorker(job, job.data)
      .then( (data) => {
        done();
      })
      .catch( (err) => {
        done(err);
      })
    });

read some comments that domains are deprecated-> https://nodejs.org/api/domain.html

victusfate avatar Aug 30 '16 17:08 victusfate

some good stuff mentioned in this thread as well (similar issue) https://github.com/Automattic/kue/issues/130

I'm trying something in the workers now to gracefully shutdown. I was reliably getting the queue stuck by queueing jobs and killing the worker. It would never have a chance to call done. Also if a worker crashes the same can happen (hence the domain grabs or maybe a try catch)

victusfate avatar Sep 08 '16 15:09 victusfate

ahoy olalonde -> this was an earlier batch job I was able to run and kill and get consistently stuck jobs. Now with some modifications it doesn't get stuck, but I'm seeing some active jobs just hanging out in limbo. I commented on this in issue #130

Ok put together a gist with graceful queue and worker shutdown. I'm still seeing a stuck active job, so I think worker pause is not triggering active jobs into an inactive state.

Here's the gist: https://gist.github.com/victusfate/1e2ce9eb73de32b78d2690d660f0f9c8

victusfate avatar Sep 08 '16 16:09 victusfate

Updated the gist to handle setting active jobs to inactive

Ok, believe my latest version of that gist works as expected, pauses worker and makes any incomplete jobs inactive so other workers or future workers can pick them up.

victusfate avatar Sep 08 '16 17:09 victusfate

@victusfate good job 👍 What happens if the process signal handlers are not called? do the inactive jobs get unstuck eventually?

olalonde avatar Sep 09 '16 20:09 olalonde

Yeah I didn't handle uncaught exceptions, and there could be other signals I missed but it worked very well while I killed and restarted it testing locally. No stuck queue - in the earlier version I could reliably recreate a stuck queue just be killing the workers and rerunning them.

So the above sample code is like some level of battle hardening but not break proof. Still it resolved all the stuck queue issues I've seen on our dev/prod environments.

victusfate avatar Sep 15 '16 14:09 victusfate

I had this for a backlog of 1000+ jobs, I ended up writing a script to clear them all out and restarted the server.

Not sure what caused it to happen but there were a number of jobs created simultaneously (in error) and each of them is long running.

sample code to start cleaning stuff up in redis (please disregard the sp0n stuff it's from private repo):

'use strict';

const config        = require('sp0n-config');
const logglyToken   = config.Loggly.Token;
const subdomain     = config.Loggly.Subdomain;
const logInitialize = require('sp0n-logger').init;
const kue           = require('kue');
const url           = require('url');
const redis         = require('redis');
const util          = require('sp0n-util');

var getRedis = () => {
  // if need to clean up locally process.env.LOCAL_REDIS_URL
  var redisUrl = url.parse(config.Redis.RedisUrl)
    , client = redis.createClient(redisUrl.port, redisUrl.hostname);
  // console.log({ action: 'app', redisUrl: redisUrl });
  if (redisUrl.auth) {
    client.auth(redisUrl.auth.split(":")[1]);
  }
  return client;
}

var isInt = (val) => {
  return typeof val === 'number' && Math.floor(val) === val;
}

var queue = kue.createQueue({
  redis: { 
    createClientFactory: getRedis
  }
});

queue.inactive( function( err, ids ) { // others are active, complete, failed, delayed
  const sAction = 'queue.inactive';
  console.log({ action: sAction, ids: ids });
  for (let id in ids) {
    kue.Job.remove( ids[id], (err) => {
      if (err) {
        console.error({ action: sAction + '.err', id: ids[id], err:err });
      }
    });
  }
});

// another way to go
// queue.Job.rangeByState( 'inactive', 0, 10000, 'asc', function( err, jobs ) {
//   for (let job in jobs) {
//     kue.Job.remove( jobs[job].id, (err) => {
//       if (err) {
//         console.error({ action: sAction + '.err', id: ids[id], err:err });
//       }
//     });
//   }
// });

Does this mean jobs that fail are removed and never run?

aiavci avatar Jan 01 '20 16:01 aiavci