dragonfly icon indicating copy to clipboard operation
dragonfly copied to clipboard

Dragonfly does not work with nest/bull

Open romange opened this issue 2 years ago • 48 comments

See this https://github.com/BlackYuzia/Dragonfly-Bull to reproduce.

  1. Dragonfly rejects EVAL under MULTI (which could potentially be fixed by making the multi TX global - dynamically)
  2. We also do not allow undeclared keys and this library uses undeclared keys extensively. See, for example, https://github.com/OptimalBits/bull/blob/develop/lib/commands/addJob-6.lua

See where I think EVAL is called under multi (but I am not sure): https://github.com/OptimalBits/bull/blob/master/lib/job.js#L88

First reported in #781

romange avatar Feb 12 '23 19:02 romange

@romange on some occasions we use multi and make a bunch of EVAL/EVALSHA calls, for example when adding jobs in a batch. I think Bull and BullMQ are probably the libraries that do more crazy stuff with lua scripts, if all Bull/BullMQ tests pass with Dragonfly I think you can be quite confident that your LUA script support is 100% Redis compatible 😅

manast avatar Feb 16 '23 09:02 manast

Thanks @manast ! If it's not too hard, can you please write here (or point me to) how to run these tests? Also, where to override "Redis" backend in these tests (to inject dragonfly instead).

romange avatar Feb 16 '23 11:02 romange

It is quite straightforward, just clone the package, install dependencies with yarn, and then run yarn test. The tests will use whatever Redis instance is running locally (it will connect to localhost:6379). I recommend you start with BullMQ as it has more features (and more intricate lua scripts): https://github.com/taskforcesh/bullmq

manast avatar Feb 16 '23 13:02 manast

Thank you, Manuel!

On Thu, Feb 16, 2023 at 3:03 PM Manuel Astudillo @.***> wrote:

It is quite straightforward, just clone the package, install dependencies with yarn, and then run yarn test. The tests will use whatever Redis instance is running locally (it will connect to localhost:6379). I recommend you start with BullMQ as it has more features (and more intricate lua scripts): https://github.com/taskforcesh/bullmq

— Reply to this email directly, view it on GitHub https://github.com/dragonflydb/dragonfly/issues/782#issuecomment-1433056753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4BFCFKADT5KL5RQBJUUSDWXYQRVANCNFSM6AAAAAAUZQH7BU . You are receiving this because you were mentioned.Message ID: @.***>

romange avatar Feb 16 '23 13:02 romange

discussion for reference, https://github.com/taskforcesh/bullmq/discussions/1689

DanielNetzer avatar Feb 17 '23 09:02 DanielNetzer

I tried to run a simple test with dragonfly (in intel MacOS) and just trying to add a job to a queue (the simplest operation in BullMQ) will give this error:

Error: write EPIPE
    at afterWriteDispatched (node:internal/stream_base_commons:160:15)
    at writeGeneric (node:internal/stream_base_commons:151:3)
    at Socket._writeGeneric (node:net:917:11)
    at Socket._write (node:net:929:8)
    at writeOrBuffer (node:internal/streams/writable:392:12)
    at _write (node:internal/streams/writable:333:10)
    at Writable.write (node:internal/streams/writable:337:10)
    at EventEmitter.sendCommand (/Users/manuelastudillo/Dev/taskforce/bullmq/node_modules/ioredis/built/Redis.js:394:29)
    at /Users/manuelastudillo/Dev/taskforce/bullmq/node_modules/ioredis/built/redis/event_handler.js:266:26
    at /Users/manuelastudillo/Dev/taskforce/bullmq/node_modules/ioredis/built/redis/event_handler.js:69:51 {
  errno: -32,
  code: 'EPIPE',
  syscall: 'write'
}

manast avatar May 21 '23 09:05 manast

(running v1.3.0 of dragonfly)

manast avatar May 21 '23 09:05 manast

I did some more tests and they fail because some Redis commands are not implemented yet: XREAD, XTRIM, etc.

manast avatar May 21 '23 09:05 manast

I just encountered this today when trying to use dragonfly instead of redis with immich :(

https://github.com/immich-app/immich/issues/2542

uhthomas avatar May 23 '23 18:05 uhthomas

I also really don't mean to be pedantic, but the compatibility claim is very misleading.

Fully compatible with Redis and Memcached APIs, Dragonfly requires no code changes to adopt.

https://www.dragonflydb.io/redis-alternative

Fully compatible with Redis Dragonfly is a drop-in Redis replacement, meaning it uses the same APIs and is compatible with all of the same SDKs and tooling. Teams that switch from Redis to Dragonfly get huge performance gains and simpler system to operate, all without changing code.

Don't get me wrong, Dragonfly is great and I'm keen to use it, but I do feel mislead.

uhthomas avatar May 23 '23 18:05 uhthomas

Hello @uhthomas . Sorry about that! we have implemented around 200 commands that are fully compatible. These docs expand on what commands are supported so far: https://www.dragonflydb.io/docs/category/command-reference

Lua scripts and specifically job management frameworks are still not cracked fully yet, but we are working on completing this functionality. You are right that we should do a better job at explaining the compatibility gap.

romange avatar May 24 '23 05:05 romange

@uhthomas

Hi. We just newly support this behavior, but don't have yet an explanation for it.

Dragonfly by default disallows accessing undeclared keys in Lua scripts. You can enable it with the following flag: dragonfly --default_lua_config=allow-undeclared-keys.

dranikpg avatar May 24 '23 06:05 dranikpg

@dranikpg is correct, but I think we should perform more test coverage before we advice bull users to try Dragonfly.

romange avatar May 24 '23 06:05 romange

Thanks for following up. I look forward to using dragonfly at some point in the future.

200 commands

A naive look at https://redis.io/commands/ suggests there are 470 commands in total? Is this correct?

document.querySelectorAll('article').length
470

uhthomas avatar May 24 '23 23:05 uhthomas

No, it's not correct. That includes "Redis Stack", i.e. including Redis Modules. For Redis OSS (v7):

image

romange avatar May 27 '23 13:05 romange

image

romange avatar May 27 '23 13:05 romange

@dranikpg @romange it's not perfect yet (there may be some performance optimizations that need to be made?), but I can confirm that DragonflyDB with the --default_lua_flags=allow-undeclared-keys flag is working in production used with bull and bull-board.

ThatOneCalculator avatar Jun 01 '23 05:06 ThatOneCalculator

Did you mean --default_lua_flags=allow-undeclared-keys maybe?

chakaz avatar Jun 01 '23 06:06 chakaz

Large report of probably non-issue, good for context (click to reveal) Interestingly, I keep getting stall errors as of https://github.com/dragonflydb/dragonfly/commit/cde6adb34d1dd016a03cdb875bf14ba2311d9244 , where *every* job is being pushed into waiting and nothing is being processed with this error:
0|Calckey_social  | WARN 6    [queue deliver]    failed(Error: job stalled more than allowable limit) id=16664 attempts=0/10 age=1632m to=https://xxx.xxx/inbox

(where xxx.xxx is each instance)

image image

ThatOneCalculator avatar Jun 02 '23 22:06 ThatOneCalculator

I cleared the queue. Reverting back to e3086f56a106ef4d7d2da4cc3748127a93def69e works fine, but updating again to cde6adb34d1dd016a03cdb875bf14ba2311d9244 produces a few of generic queue errors (i.e. [queue inbox] error Error processing job), but does seem to work okay overall.

ThatOneCalculator avatar Jun 02 '23 23:06 ThatOneCalculator

Pure guess: do you use a debug build? Perhaps https://github.com/dragonflydb/dragonfly/commit/753f7fcdbcb013838501f8f4a772a2f4cc4d14a4 fixes this issue?

chakaz avatar Jun 03 '23 04:06 chakaz

No, I use the build from source method to make a release build from the latest git commit, and I already have that commit in the build.

ThatOneCalculator avatar Jun 03 '23 04:06 ThatOneCalculator

Not sure if its related to those Error processing job issues - but running https://github.com/taskforcesh/bullmq tests with the above fixes theres still quite a few failures - will keep looking

Edit: Lots of tests seem to be just timing out rather than getting errors - though some like when there are more added jobs than max limiter are getting Error: write EPIPE and failing even with the timeout removed (running release build of ff1ffa0)

andydunstall avatar Jun 03 '23 04:06 andydunstall

I wonder if I should migrate from bull to bullmq for additional testing, I'm not sure how many people use bull compared to bullmq :thinking:

ThatOneCalculator avatar Jun 03 '23 04:06 ThatOneCalculator

I'm noticing some interesting graph patterns. Normally, Redis would have jobs constantly in delayed/processing and have a steady stream of active jobs, and not have repeating patterns like this.

image

ThatOneCalculator avatar Jun 03 '23 07:06 ThatOneCalculator

Over the last couple of days, here are the biggest things I've noticed when comparing how Redis operates with Bull vs Dragonfly with the exact same limits/configs:

  • With Dragonfly, jobs tend to get stuck in waiting and not get processes, leading to massive backlogs
  • With Dragonfly, jobs tend to go from process to active as if they're handing it off to each other, compared to Redis which handles them both simultaneously
  • With Dragonfly, jobs tend to get delayed a lot more for seemingly no reason
  • Dragonfly does seem to have the potential to handle far more Process and Active jobs simultaneously than Redis

ThatOneCalculator avatar Jun 04 '23 06:06 ThatOneCalculator

Thank you for your observations. However, Dragonfly still does not support Bull, so I would not jump to conclusions. I hope we will be able to learn in the coming weeks what's missing in Dragonfly to make it compatible with Bull. The effort will take some time :)

romange avatar Jun 04 '23 10:06 romange

Exactly, and I'm aware. Just happy to share my observations to help see what behaviors occur and if they end up being fixed with updates :)

ThatOneCalculator avatar Jun 04 '23 18:06 ThatOneCalculator

Running the BullMQ tests again with the above fixes, theres still 5/388 tests failing:

  • Delayed jobs: should process delayed jobs with several workers respecting delay
    • Not waiting the expected delay (waited at least delay time)
    • Note finding this fails with Redis too when running locally but on a different assertion
    • Edit: Actually the waited at least delay time can fail when using Redis too, though less often. I think its due to Lua scripts being run with a timestamp set on the client before the request is made, but concurrent scripts from different workers could run out of order so a script with timestamp T2 could run before another script with timestamp T1, which can occur in Redis but seems more likely with Dragonfly. In this case seeing a delayed job being moved to the wait queue at time T2 (where T2 > delay timestamp), but then the jobs 'processed on' is set to T1 (where T1 < delay timestamp). Theres only a few ms difference but thats enough for the assertion to fail, which checks 'processed on' is less than the delay timestamp
  • Job: removes 4000 jobs in time range of 4000ms
    • Times out after 4 seconds
    • Increasing the timeout this often passes (in ~6 seconds compared to Redis running in ~400ms), though it also sometimes gets a write EPIPE error
  • Obliterate: should obliterate a queue with high number of jobs in different statuses
    • Times out after 6 seconds
    • Increasing the timeout it gets ECONNRESET errors but retries and eventually passes in over 15 seconds
  • Pause: should not process delayed jobs
    • Sometimes passes, though often fails with an assertion error
    • Edit: I think this is similar to should process delayed jobs with several workers respecting delay where the order concurrent scripts run affects whether the test passes or fails. If the worker moveToActive runs before the job is added, the test passes, but if it runs after the job is added, the test fails (adding small await delays can cause Redis to fail and DF to pass)
  • Rate Limiter: processes jobs as max limiter from the beginning
    • Receives ECONNRESET errors and times out
    • Removing the timeout it still gets multiple ECONNRESET errors then hangs

Looking briefly at the ECONNRESET errors, they seem to occur with both epoll and io_uring. Adding logs to the server, it only closes the connection after recv returns 0. tcpdump shows there are no FIN packets before the RST, just data packets sent from the client to the server then the server responds with the RST. Not sure if I’ve configured something wrong, will keep looking

andydunstall avatar Jun 07 '23 17:06 andydunstall

@andydunstall I guess this is the logic that leads to ECONNRESET errors: https://github.com/romange/helio/blob/master/util/fibers/epoll_socket.cc#L347

basically we get 0 from recv, and return connection-aborted. The logic above the call probably just closes the socket without going through the shutdown sequence. And it may indeed be wrong, but do you know that the client code indeed shut down the connection before dragonfly received 0? I wonder if the problem is ECONNRESET or Dragonfly finds itself out of sync with bull client where Dragonfly reads from its socket when it should not...

romange avatar Jun 08 '23 07:06 romange