[Bug]: moveToFailed throws an exception when using Elasticache serverless
Version
5.12.0
Platform
NodeJS
What happened?
We are using Elasticache Serverless instance (redis v7.1) When adding a job to queue and the job fails an exception is thrown -
ReplyError: EXECABORT Transaction discarded because of previous errors.
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: { name: 'exec', args: [] },
previousErrors: [
ReplyError: ERR command not supported inside transaction
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: [Object]
},
ReplyError: ERR command not supported inside transaction
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: [Object]
}
]
}
How to reproduce.
replace some-serverless-host with a relevant redis instance
import {Worker, Queue, UnrecoverableError} from 'bullmq';
import Redis from 'ioredis';
const clusterQueue = new Queue('test-queue', {
prefix: '{bullMQ}',
connection: new Redis.Cluster([
{host: 'some-serverless-host', port: 6379},
], {
dnsLookup: (address, callback) => callback(null, address),
redisOptions: {
tls: true,
}
})
})
export async function renderQueue() {
await clusterQueue.add('name:some-name', 'some-job-data')
}
const WorkerQueue = new Worker('test-queue', async (job) => {
throw new UnrecoverableError('test cluster exception')
}, {
connection: new Redis.Cluster([
{host: 'some-serverless-host', port: 6379},
], {
dnsLookup: (address, callback) => callback(null, address),
redisOptions: {
tls: true,
}
}),
prefix: '{bullMQ}'
})
WorkerQueue.on('waiting', () => console.log('waiting completed'))
WorkerQueue.on('completed', () => console.log('jobs completed'))
WorkerQueue.on('failed', () => console.log('failed completed'))
Relevant log output
ReplyError: EXECABORT Transaction discarded because of previous errors.
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: { name: 'exec', args: [] },
previousErrors: [
ReplyError: ERR command not supported inside transaction
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: [Object]
},
ReplyError: ERR command not supported inside transaction
at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
command: [Object]
}
]
}
### Code of Conduct
- [X] I agree to follow this project's Code of Conduct
hey, from what I can see from the stack trance. It pointa to bull-monitor and redis-parser internal
Another comment is that you must not use job names that includes : as we will throw an error.
Hi @roggervalf basically we are using BullMq with elasticache serverless And for some reason we are getting that error, every time a task failed and tries to move to error any idea why? (in regular elasticache its working as expected)
ChatGPT told me the following: "Based on the detailed information you’ve provided, the error you’re encountering stems from using AWS ElastiCache Redis Serverless, which has certain limitations compared to standard Redis installations. Specifically, it does not support some commands that BullMQ relies on, such as EVAL and EVALSHA, especially within transactions. This incompatibility leads to the ERR command not supported inside transaction error when BullMQ tries to execute these commands."
So it seems that Elasticache server less does not support calling Lua scripts within a transaction, which is something that is used in moveToFailed. Although not used extensively, there are other parts where we use evalsha in multi/exec transactions, such as when adding jobs in bulk. The only way to support AWS elasticache server less would be to convert these transactions till pure Lua scripts, which is doable but probably a couple of days of work. Maybe AWS also plans to support for this themselves?
I will keep this open as an enhancement as moving to pure Lua scripts is a long term goal anyway, as it is more robust than using multi/exec from a transactional perspective (as you get better atomic guarantees).
@manast thanks for the info!
+1, serverless redis becomes the first choice option nowadays in AWS. It looks like soon it will be server less valkey due to redis licensing.
+1. We use AWS Elasticache serverless and recently got error complaining about EVALSHA which breaks the job lock functionality. As result any job running for more than 30s will be put back to the queue and double executed.
@manast Your previous comment suggested EVALSHA is not supported by serverless however I found it in the doc , any other possibility that this command doesn’t work at all?
@bowenzhou222 EVAL and EVALSHA works, but what does not work is calling these commands within a multi/exec Redis transaction. However I cannot find where this is stated, nor where it could be reported so that they could implement it in the future. For now I am trying to eliminate the use of multi + eval in the most used code paths of BullMQ, but there will be some features that will not work as they are too complicated to fix, such as flows and adding jobs in bulks.
The PR that was just merged should resolve the issue with failed jobs and lock extension, however some features will not work yet, such as flows and add bulk which uses multi as well, unfortunately they are too complex to solve as we did for moveToFailed. I think that it would be good if you contact AWS customer support and ask them about this missing feature, it may be something they could easily support if they just realise it is important for some users.
Hi @manast,
we are on the latest version of bullmq-pro (7.26.1) - from what I can gather this fix should be included in this version, however we are still seeing issues with locks after upgrading our Redis to serverless on AWS - any idea?
"stack": "ReplyError: EXECABORT Transaction discarded because of previous errors.\n at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
"message": "EXECABORT Transaction discarded because of previous errors.",
"command": {
"name": "exec",
"args": []
},
"previousErrors": [
{
"stack": "ReplyError: ERR command not supported inside transaction\n at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
"message": "ERR command not supported inside transaction",
"command": {
"name": "eval",
"args": [
"--[[\n Extend lock and removes the job from the stalled set.\n Input:\n KEYS[1] 'lock',\n KEYS[2] 'stalled'\n ARGV[1] token\n ARGV[2] lock duration in milliseconds\n ARGV[3] jobid\n Output:\n \"1\" if lock extented succesfully.\n]]\nlocal rcall = redis.call\nif rcall(\"GET\", KEYS[1]) == ARGV[1] then\n -- if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2], \"XX\") then\n if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2]) then\n rcall(\"SREM\", KEYS[2], ARGV[3])\n return 1\n end\nend\nreturn 0\n",
"2",
"{action}:action:71840:lock",
"{action}:action:stalled",
"ba1a1f4b-06e0-4150-87dc-4c942832c51a:1",
"30000",
"71840"
]
}
}
]
}```
@tobiasviehweger yes, I am not sure if this is the same because there are other places where we combine MULTI with EVAL, such as in batches, flows, schedulers, so depending on what you are using you can trigger this error.
@madolson I am pinging you in case you are not following this issue yet :)
Ah.. I see we are using schedulers.. will try to change them to be executed from somewhere else.. I'll report back if we detect other issues as well..
Hi @manast
we have now removed schedulers but still are getting this with rather normal queues from time to time.. is there anything related to stalled item processing that would trigger this as well? I'm not too deep into the retry logic, unfortunately...
{
"stack": "ReplyError: ERR command not supported inside transaction\n at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
"message": "ERR command not supported inside transaction",
"command": {
"name": "eval",
"args": [
"--[[\n Extend lock and removes the job from the stalled set.\n Input:\n KEYS[1] 'lock',\n KEYS[2] 'stalled'\n ARGV[1] token\n ARGV[2] lock duration in milliseconds\n ARGV[3] jobid\n Output:\n \"1\" if lock extented succesfully.\n]]\nlocal rcall = redis.call\nif rcall(\"GET\", KEYS[1]) == ARGV[1] then\n -- if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2], \"XX\") then\n if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2]) then\n rcall(\"SREM\", KEYS[2], ARGV[3])\n return 1\n end\nend\nreturn 0\n",
"2",
"{distribute-atlassian-webhooks}:distribute-atlassian-webhooks:comment_created_jira:64d16cea-7558-464a-8b65-b09169f1e1d3_1090151_1737540568674:lock",
"{distribute-atlassian-webhooks}:distribute-atlassian-webhooks:stalled",
"b1f4545b-6821-453b-8863-08b4005d1a38:38916",
"30000",
"comment_created_jira:64d16cea-7558-464a-8b65-b09169f1e1d3_1090151_1737540568674"
]
}
}
``
@manast Yeah, we are looking into it on our end. We manually marked a bunch of commands as blocked for multi that cause issues with consistency, specifically EVALSHA is problematic for us, and some other reasons like head of line blocking. We didn't find any issue yet with EVAL though, so it might have just been an oversight for us to block it. We have someone looking into fixing it.
@madolson just to clarify, the command that will be used by BullMQ in MULTI would indeed be EVALSHA, as it would be too slow to send the lua script in every call.
Ok, that should still be possible but will require some more effort on our side. We'll keep you posted though.
Hi @manast sorry to ping you again here - we are still seeing errors, and we are not using any special functions anymore. My guess is this is coming from the extendLock method in the JobPro class, as it is still using multi - do you see any chance in removing this from the JobPro class? Thanks and have a good weekend!
//Edit.. ah this is only in the batched case... odd..
//Edit2: Possibly the extendLock method in the WorkerPro class it is, I think.. this does also use multi for non-batched cases
@manast You think this would work? https://gist.github.com/tobiasviehweger/85a57a6a099a40f44368ef4d9ac1dcaa
@tobiasviehweger I will look into this asap or @roggervalf if you are faster than me :)
Should be fixed from version 7.26.5.
Hi @manast :)
Were getting this error(ReplyError: EXECABORT Transaction discarded because of previous errors) on bullmq vesion 5.41.3 when calling upsertJobScheduler.
Were using Elasticache serverless with Valkey 8.0.
Unfortunately it is not really possible for us to solve this issue, hopefully the valkey team can solve it, otherwise there are many other alternatives non-server less (or actually even serverless, for example Upstash that also works well with BullMQ).
@Omers-Frontegg This is something we will need to fix on our (AWS) side. Can you validate it's the same error that manast mentioned above with the ERR command not supported inside transaction.
@madolson Yep, here is the caught error.
@madolson Do you have a rough estimation when we could get a fix?
@manast Maybe you could suggest an alternative for the meantime?
hi @Omers-Frontegg we are also working on removing multi usage when upserting a job scheduler. It would take 1 or 2 weeks of work still. An alternative would be to use v5.39.2.
@Omers-Frontegg Sorry, I do not. Supporting EVAL is possible but EVALSHA is turning out to be much more difficult. It seems like maybe the problem will get solved by the change roggervalf is mentioning.
@Omers-Frontegg @madolson even though we may succeed removing multi/eval for upsertJobScheduler (which is not trivial to do either by the way), we will not be able to remove all multi/evals we have in the rest of the codebase. So in general I cannot recommend using Elasticache Serverless to be used with BullMQ at the time being, as we cannot guarantee that it will work stable enough. Until we can run the whole test suite in serverless you should not use BullMQ in production with this Redis alternative. You could on the other hand use standard Elasticache non serverless version, or Upstash as a serverless solution that does support multi/eval.
Good to know, we should support it.