bullmq icon indicating copy to clipboard operation
bullmq copied to clipboard

[Bug]: moveToFailed throws an exception when using Elasticache serverless

Open amit-opus opened this issue 1 year ago • 3 comments

Version

5.12.0

Platform

NodeJS

What happened?

We are using Elasticache Serverless instance (redis v7.1) When adding a job to queue and the job fails an exception is thrown -

ReplyError: EXECABORT Transaction discarded because of previous errors.
    at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
  command: { name: 'exec', args: [] },
  previousErrors: [
    ReplyError: ERR command not supported inside transaction
        at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
        at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
      command: [Object]
    },
    ReplyError: ERR command not supported inside transaction
        at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
        at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
      command: [Object]
    }
  ]
}

How to reproduce.

replace some-serverless-host with a relevant redis instance

import {Worker, Queue, UnrecoverableError} from 'bullmq';
import Redis from 'ioredis';

const clusterQueue = new Queue('test-queue', {
    prefix: '{bullMQ}',
    connection: new Redis.Cluster([
        {host: 'some-serverless-host', port: 6379},
    ], {
        dnsLookup: (address, callback) => callback(null, address),
        redisOptions: {
            tls: true,
        }
    })
})

export async function renderQueue() {
    await clusterQueue.add('name:some-name', 'some-job-data')
}

const WorkerQueue = new Worker('test-queue', async (job) => {
    throw new UnrecoverableError('test cluster exception')
}, {
    connection: new Redis.Cluster([
        {host: 'some-serverless-host', port: 6379},
    ], {
        dnsLookup: (address, callback) => callback(null, address),
        redisOptions: {
            tls: true,
        }
    }),
    prefix: '{bullMQ}'
})

WorkerQueue.on('waiting', () => console.log('waiting completed'))
WorkerQueue.on('completed', () => console.log('jobs completed'))
WorkerQueue.on('failed', () => console.log('failed completed'))

Relevant log output

ReplyError: EXECABORT Transaction discarded because of previous errors.
    at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
    at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
  command: { name: 'exec', args: [] },
  previousErrors: [
    ReplyError: ERR command not supported inside transaction
        at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
        at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
      command: [Object]
    },
    ReplyError: ERR command not supported inside transaction
        at parseError (bull-monitor/node_modules/redis-parser/lib/parser.js:179:12)
        at parseType (bull-monitor/node_modules/redis-parser/lib/parser.js:302:14) {
      command: [Object]
    }
  ]
}


### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

amit-opus avatar Sep 26 '24 14:09 amit-opus

hey, from what I can see from the stack trance. It pointa to bull-monitor and redis-parser internal

roggervalf avatar Sep 30 '24 06:09 roggervalf

Another comment is that you must not use job names that includes : as we will throw an error.

roggervalf avatar Sep 30 '24 06:09 roggervalf

Hi @roggervalf basically we are using BullMq with elasticache serverless And for some reason we are getting that error, every time a task failed and tries to move to error any idea why? (in regular elasticache its working as expected)

or-opus avatar Oct 01 '24 13:10 or-opus

ChatGPT told me the following: "Based on the detailed information you’ve provided, the error you’re encountering stems from using AWS ElastiCache Redis Serverless, which has certain limitations compared to standard Redis installations. Specifically, it does not support some commands that BullMQ relies on, such as EVAL and EVALSHA, especially within transactions. This incompatibility leads to the ERR command not supported inside transaction error when BullMQ tries to execute these commands."

So it seems that Elasticache server less does not support calling Lua scripts within a transaction, which is something that is used in moveToFailed. Although not used extensively, there are other parts where we use evalsha in multi/exec transactions, such as when adding jobs in bulk. The only way to support AWS elasticache server less would be to convert these transactions till pure Lua scripts, which is doable but probably a couple of days of work. Maybe AWS also plans to support for this themselves?

manast avatar Nov 16 '24 12:11 manast

I will keep this open as an enhancement as moving to pure Lua scripts is a long term goal anyway, as it is more robust than using multi/exec from a transactional perspective (as you get better atomic guarantees).

manast avatar Nov 16 '24 12:11 manast

@manast thanks for the info!

or-opus avatar Nov 18 '24 08:11 or-opus

+1, serverless redis becomes the first choice option nowadays in AWS. It looks like soon it will be server less valkey due to redis licensing.

mariuszbeltowski avatar Nov 29 '24 09:11 mariuszbeltowski

+1. We use AWS Elasticache serverless and recently got error complaining about EVALSHA which breaks the job lock functionality. As result any job running for more than 30s will be put back to the queue and double executed.

@manast Your previous comment suggested EVALSHA is not supported by serverless however I found it in the doc , any other possibility that this command doesn’t work at all?

bowenzhou222 avatar Nov 29 '24 11:11 bowenzhou222

@bowenzhou222 EVAL and EVALSHA works, but what does not work is calling these commands within a multi/exec Redis transaction. However I cannot find where this is stated, nor where it could be reported so that they could implement it in the future. For now I am trying to eliminate the use of multi + eval in the most used code paths of BullMQ, but there will be some features that will not work as they are too complicated to fix, such as flows and adding jobs in bulks.

manast avatar Dec 08 '24 16:12 manast

The PR that was just merged should resolve the issue with failed jobs and lock extension, however some features will not work yet, such as flows and add bulk which uses multi as well, unfortunately they are too complex to solve as we did for moveToFailed. I think that it would be good if you contact AWS customer support and ask them about this missing feature, it may be something they could easily support if they just realise it is important for some users.

manast avatar Dec 09 '24 09:12 manast

Hi @manast,

we are on the latest version of bullmq-pro (7.26.1) - from what I can gather this fix should be included in this version, however we are still seeing issues with locks after upgrading our Redis to serverless on AWS - any idea?

            "stack": "ReplyError: EXECABORT Transaction discarded because of previous errors.\n    at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n    at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
            "message": "EXECABORT Transaction discarded because of previous errors.",
            "command": {
                "name": "exec",
                "args": []
            },
            "previousErrors": [
                {
                    "stack": "ReplyError: ERR command not supported inside transaction\n    at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n    at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
                    "message": "ERR command not supported inside transaction",
                    "command": {
                        "name": "eval",
                        "args": [
                            "--[[\n  Extend lock and removes the job from the stalled set.\n  Input:\n    KEYS[1] 'lock',\n    KEYS[2] 'stalled'\n    ARGV[1]  token\n    ARGV[2]  lock duration in milliseconds\n    ARGV[3]  jobid\n  Output:\n    \"1\" if lock extented succesfully.\n]]\nlocal rcall = redis.call\nif rcall(\"GET\", KEYS[1]) == ARGV[1] then\n  --   if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2], \"XX\") then\n  if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2]) then\n    rcall(\"SREM\", KEYS[2], ARGV[3])\n    return 1\n  end\nend\nreturn 0\n",
                            "2",
                            "{action}:action:71840:lock",
                            "{action}:action:stalled",
                            "ba1a1f4b-06e0-4150-87dc-4c942832c51a:1",
                            "30000",
                            "71840"
                        ]
                    }
                }
            ]
        }```

tobiasviehweger avatar Jan 17 '25 14:01 tobiasviehweger

@tobiasviehweger yes, I am not sure if this is the same because there are other places where we combine MULTI with EVAL, such as in batches, flows, schedulers, so depending on what you are using you can trigger this error.

@madolson I am pinging you in case you are not following this issue yet :)

manast avatar Jan 17 '25 18:01 manast

Ah.. I see we are using schedulers.. will try to change them to be executed from somewhere else.. I'll report back if we detect other issues as well..

tobiasviehweger avatar Jan 21 '25 21:01 tobiasviehweger

Hi @manast

we have now removed schedulers but still are getting this with rather normal queues from time to time.. is there anything related to stalled item processing that would trigger this as well? I'm not too deep into the retry logic, unfortunately...

{
                    "stack": "ReplyError: ERR command not supported inside transaction\n    at parseError (/var/app/current/node_modules/redis-parser/lib/parser.js:179:12)\n    at parseType (/var/app/current/node_modules/redis-parser/lib/parser.js:302:14)",
                    "message": "ERR command not supported inside transaction",
                    "command": {
                        "name": "eval",
                        "args": [
                            "--[[\n  Extend lock and removes the job from the stalled set.\n  Input:\n    KEYS[1] 'lock',\n    KEYS[2] 'stalled'\n    ARGV[1]  token\n    ARGV[2]  lock duration in milliseconds\n    ARGV[3]  jobid\n  Output:\n    \"1\" if lock extented succesfully.\n]]\nlocal rcall = redis.call\nif rcall(\"GET\", KEYS[1]) == ARGV[1] then\n  --   if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2], \"XX\") then\n  if rcall(\"SET\", KEYS[1], ARGV[1], \"PX\", ARGV[2]) then\n    rcall(\"SREM\", KEYS[2], ARGV[3])\n    return 1\n  end\nend\nreturn 0\n",
                            "2",
                            "{distribute-atlassian-webhooks}:distribute-atlassian-webhooks:comment_created_jira:64d16cea-7558-464a-8b65-b09169f1e1d3_1090151_1737540568674:lock",
                            "{distribute-atlassian-webhooks}:distribute-atlassian-webhooks:stalled",
                            "b1f4545b-6821-453b-8863-08b4005d1a38:38916",
                            "30000",
                            "comment_created_jira:64d16cea-7558-464a-8b65-b09169f1e1d3_1090151_1737540568674"
                        ]
                    }
                }
``

tobiasviehweger avatar Jan 22 '25 10:01 tobiasviehweger

@manast Yeah, we are looking into it on our end. We manually marked a bunch of commands as blocked for multi that cause issues with consistency, specifically EVALSHA is problematic for us, and some other reasons like head of line blocking. We didn't find any issue yet with EVAL though, so it might have just been an oversight for us to block it. We have someone looking into fixing it.

madolson avatar Jan 22 '25 17:01 madolson

@madolson just to clarify, the command that will be used by BullMQ in MULTI would indeed be EVALSHA, as it would be too slow to send the lua script in every call.

manast avatar Jan 22 '25 17:01 manast

Ok, that should still be possible but will require some more effort on our side. We'll keep you posted though.

madolson avatar Jan 23 '25 05:01 madolson

Hi @manast sorry to ping you again here - we are still seeing errors, and we are not using any special functions anymore. My guess is this is coming from the extendLock method in the JobPro class, as it is still using multi - do you see any chance in removing this from the JobPro class? Thanks and have a good weekend!

//Edit.. ah this is only in the batched case... odd..

//Edit2: Possibly the extendLock method in the WorkerPro class it is, I think.. this does also use multi for non-batched cases

@manast You think this would work? https://gist.github.com/tobiasviehweger/85a57a6a099a40f44368ef4d9ac1dcaa

tobiasviehweger avatar Jan 25 '25 13:01 tobiasviehweger

@tobiasviehweger I will look into this asap or @roggervalf if you are faster than me :)

manast avatar Jan 26 '25 10:01 manast

Should be fixed from version 7.26.5.

manast avatar Feb 02 '25 21:02 manast

Hi @manast :) Were getting this error(ReplyError: EXECABORT Transaction discarded because of previous errors) on bullmq vesion 5.41.3 when calling upsertJobScheduler. Were using Elasticache serverless with Valkey 8.0.

Omers-Frontegg avatar Feb 20 '25 15:02 Omers-Frontegg

Unfortunately it is not really possible for us to solve this issue, hopefully the valkey team can solve it, otherwise there are many other alternatives non-server less (or actually even serverless, for example Upstash that also works well with BullMQ).

manast avatar Feb 20 '25 16:02 manast

@Omers-Frontegg This is something we will need to fix on our (AWS) side. Can you validate it's the same error that manast mentioned above with the ERR command not supported inside transaction.

madolson avatar Feb 20 '25 23:02 madolson

@madolson Yep, here is the caught error.

Image

Omers-Frontegg avatar Feb 21 '25 09:02 Omers-Frontegg

@madolson Do you have a rough estimation when we could get a fix?

@manast Maybe you could suggest an alternative for the meantime?

Omers-Frontegg avatar Feb 22 '25 08:02 Omers-Frontegg

hi @Omers-Frontegg we are also working on removing multi usage when upserting a job scheduler. It would take 1 or 2 weeks of work still. An alternative would be to use v5.39.2.

roggervalf avatar Feb 22 '25 16:02 roggervalf

@Omers-Frontegg Sorry, I do not. Supporting EVAL is possible but EVALSHA is turning out to be much more difficult. It seems like maybe the problem will get solved by the change roggervalf is mentioning.

madolson avatar Feb 25 '25 05:02 madolson

@Omers-Frontegg @madolson even though we may succeed removing multi/eval for upsertJobScheduler (which is not trivial to do either by the way), we will not be able to remove all multi/evals we have in the rest of the codebase. So in general I cannot recommend using Elasticache Serverless to be used with BullMQ at the time being, as we cannot guarantee that it will work stable enough. Until we can run the whole test suite in serverless you should not use BullMQ in production with this Redis alternative. You could on the other hand use standard Elasticache non serverless version, or Upstash as a serverless solution that does support multi/eval.

manast avatar Feb 25 '25 09:02 manast

Good to know, we should support it.

madolson avatar Feb 25 '25 21:02 madolson