[Bug]: WorkerHost's old code being used after a code change
Version
4.18.2
Platform
NodeJS
What happened?
We had a custom handling inside a process function of a worker that adds a key to another Redis DB than bull jobs DB. The handling failed with Redis error Connection is closed . To fix the problem we've removed this locking logic fully to fix jobs being processed successfully but the job handling still continued to give this error and error stack was still pointing to the code line that did not exist at that point. The result was that it triggered failed event of event listener
@OnQueueEvent('failed')
We've tried to remove the job and add it back again but it did not help and seemed that an old code of processor was kept somewhere in Bull. We've restarted the process so the Worker will be created again but that did not help too. The problem here is that once a job was processed with an error the processor still continues to throw same error even after the code was changed and re-deployed. We have also rebooted the Redis instance, cleaned up the DB of BullMQ jobs but the newly created jobs still had the error of connection being closed. In the end, the problem was fixed with creating a brand new Redis instance. I have tried to replicate behavour in my local environment and after fixing the code the processing recovered as expected but during the incident the problem persisted.
- Can you help identify if BullMQ has a mechanism of caching the processor code, if yes, how to overcome it?
- How failed jobs are handled under the hood if there are no attempts specified?
How to reproduce.
The problem is not clearly reproducible as it seemed to be specific to Redis instance but the cleanup did not resolve it. So here is part of the code that failed.
- Have a simple Nest JS WorkerHost that will throw and error while processing
- The failed event will be triggered
- Remove the code that throws errors and some some cases in production the processor was still throwing the previous code's error `import { Processor, WorkerHost } from '@nestjs/bullmq'; import { Job } from 'bullmq';
import { RedisService } from '../../redis/redis.service';
@Processor('SCHEDULER_QUEUE') export class ConsumerBullScheduled extends WorkerHost { constructor(readonly redisService: RedisService) { super(); }
async process(job: Job<{ identifier: string }>) { const locked = await this.redisService.setIfNotExists( 'id', 'value', 2 * 60, ); // or something that throws error // processing continues here } } `
Relevant log output
[onFail] Job repeat:7a1b5bc209f6c2c602617430ed6a0ad5:1744534912175 failed with error Connection is closed.
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
I understand your frustration, but BullMQ does not keep old code anywhere. Probably you have some old worker that is still working and you are not aware of it. By calling Queue.getWorkers, or using a dashboard you may get some light shed on it.
One more question, if we had 2 different versions of processes running, could the @OnQueueEvent('failed') event be fired in the process which did not result the failed job?
If the job is failed and has backoff strategy with attempts, will the new attempts use new processor code? If the removeOnFail is 1h, will the job fire new failed events during that time even if it did not have attempts?
One more question, if we had 2 different versions of processes running, could the @OnQueueEvent('failed') event be fired in the process which did not result the failed job?
You should use the onWorkerEvent decorator if you want to listen to only the events generated by that worker: https://docs.nestjs.com/techniques/queues#event-listeners
If the job is failed and has backoff strategy with attempts, will the new attempts use new processor code? If the removeOnFail is 1h, will the job fire new failed events during that time even if it did not have attempts?
This is not really BullMQ related, this is how node works, if you deploy new code, then the new code will work. Obviously if you have a job that is delayed by 1 hour and then during that time you do a new deployment, the new code will run when it is time to process the job.
I fully understand the logic of new code being deployed hence used on retry but the logs of the production instance show that even after removing the faulty worker the on queue failed event errors still continued to show up in the new deployment with the same connection is closed error for some reason, which made us recreate new Redis instance and it solved the issue and there is no logic that could explain that. Could the faulty worker event count be high that after killing it the other workers still get the events ?