bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

[Bug]: removeOnFail with Age Not Working

Open stevenolay opened this issue 1 year ago • 5 comments

Version

4.10.0

Platform

NodeJS

What happened?

removeOnFail works as intended when supplied a boolean however, when supplied an age, the job lives indefinitely in the Queue.

How to reproduce.

import { Queue, QueueEvents, Worker } from 'bullmq';
import { assert } from 'console';

// utils

async function sleep(ms: number) {
    return new Promise((resolve) => {
        setTimeout(resolve, ms);
    });
}

const DEFAULT_JOB_NAME = '__default__';

const REMOVE_ON_FAIL_IMMEDIATE = 'remove on fail immediate';

const RemoveOnFailImmediateQueue = new Queue(REMOVE_ON_FAIL_IMMEDIATE, {
    connection: { host: 'localhost' },
    defaultJobOptions: {
        removeOnComplete: true,
        removeOnFail: true,
        attempts: 1,
    }
});

const RemoveOnFailImmediateQueueEvents = new QueueEvents(REMOVE_ON_FAIL_IMMEDIATE, { connection: { host: 'localhost' } });

RemoveOnFailImmediateQueueEvents.on('completed', ({ jobId }) => {
    console.log('done testing remove immediate');
});

RemoveOnFailImmediateQueueEvents.on(
    'failed',
    ({ jobId }: { jobId: string; }) => {
        console.log('Error in RemoveOnFailImmediateQueue');
    },
);


const RemoveOnFailImmediateWorker = new Worker(REMOVE_ON_FAIL_IMMEDIATE, async job => {
    throw new Error('TEST ERROR');
}, { connection: { host: 'localhost' } });

async function testRemoveOnFailImmediate() {
    const jobId = 'removeImmediate'
    await RemoveOnFailImmediateQueue.add(DEFAULT_JOB_NAME, {}, { jobId })
    await sleep(10); // Sleep not needed. Here for posterity. 
    const job = await RemoveOnFailImmediateQueue.getJob(jobId);

    // Manast I know you hate null ;)
    const jobIsUndefined = !job;
    assert(jobIsUndefined)
}

const REMOVE_ON_FAIL_WITH_AGE = 'remove on fail with age';

const RemoveOnFailWithAgeQueue = new Queue(REMOVE_ON_FAIL_WITH_AGE, {
    connection: { host: 'localhost' },
    defaultJobOptions: {
        removeOnComplete: true,
        removeOnFail: { age: 2 },
        attempts: 1,
    }
});

const RemoveOnFailWithAgeQueueEvents = new QueueEvents(REMOVE_ON_FAIL_WITH_AGE, { connection: { host: 'localhost' } });

RemoveOnFailWithAgeQueueEvents.on('completed', ({ jobId }) => {
    console.log('done testing remove with age');
});

RemoveOnFailWithAgeQueueEvents.on(
    'failed',
    ({ jobId }: { jobId: string; }) => {
        console.log('Error in RemoveOnFailWithAgeQueue');
    },
);


const RemoveOnFailWithAgeWorker = new Worker(REMOVE_ON_FAIL_WITH_AGE, async job => {
    throw new Error('TEST ERROR');
}, { connection: { host: 'localhost' } });

async function testRemoveOnFailWithAge() {
    const jobId = 'removeWithAge'
    await RemoveOnFailWithAgeQueue.add(DEFAULT_JOB_NAME, {}, { jobId })

    const seconds = 5 * 1000 // 5 seconds;
    await sleep(seconds);

    const job = await RemoveOnFailWithAgeQueue.getJob(jobId);

    const jobIdUndefined = !job;
    assert(jobIdUndefined, 'testRemoveOnFailWithAge: Job Still Exists In Queue.')

    if(job){
        const failed = await job.isFailed();
        assert(!failed,'testRemoveOnFailWithAge: Job is marked as failed.' )
    }
}


testRemoveOnFailImmediate()
    .then(testRemoveOnFailWithAge)
    .then(() => sleep(1000)) // Sleep added to catch logs prior to exit
    .finally(() => process.exit(0))

Relevant log output

Logs from running this code:

Error in RemoveOnFailImmediateQueue
Error in RemoveOnFailWithAgeQueue
Assertion failed: testRemoveOnFailWithAge: Job Still Exists In Queue.
Assertion failed: testRemoveOnFailWithAge: Job is marked as failed.


### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

stevenolay avatar Sep 14 '23 13:09 stevenolay

Not sure what is the difference, but we have a test that precisely test the scenario you are presenting in your issue: https://github.com/taskforcesh/bullmq/blob/master/tests/test_worker.ts#L467

manast avatar Sep 14 '23 14:09 manast

Yeah. @manast

I created a blank npm project with just this test case in it to verify i wasn't going crazy. But the job is definitely stuck on the queue and when i open the redis CLI and do hgetall on the key for the job, it's chilling there, failed and the removeOnFail setting is correct.

It's not getting removed for some reason.

I am running NodeJS 14. And my local redis version is 7.

stevenolay avatar Sep 14 '23 21:09 stevenolay

Your code is too long and complex. Why don't you start from a working code such as the one in the test case and built from there? for sure you will find the problem then.

manast avatar Sep 22 '23 13:09 manast

@manast

This is a totally fake test case I made just for simple reproduction.

I just wanted to see if I could reproduce it. My actual code is written in NestJS using the BullQueueMQ package and I was observing failed jobs stuck in the queue in production and was trying to understand why. I am running the latest version of everything bull related.

I couldn’t reproduce running your test case locally. I spent a few hours trying to understand the difference and gave up for now.

stevenolay avatar Sep 22 '23 13:09 stevenolay

Ok, let me know when you can produce a test based on the working tests from BullMQ and we can take a deeper look into it.

manast avatar Sep 22 '23 13:09 manast