sidekiq-unique-jobs
sidekiq-unique-jobs copied to clipboard
Please try v7.1.5
I added an UPGRADING.md document that I will use for guiding people on how to upgrade.
A ton of bug fixes went into this release! Even UntilAndWhileExecuting
should be less problematic now.
Finally a special shoutout to my sponsors. You make all of this work possible and worth it!
After the gem upgrade I noticed this issue (in rails console):
3.0.1 :001 > Worker.perform_async("223")
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)
3.0.1 :002 > Worker.perform_async(223)
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)
I've checked source code of: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291
def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
synchronize do |client|
command = [:brpoplpush, source, destination, timeout]
require 'pry'; binding.pry # I added it for debugging
timeout += client.timeout if timeout > 0 # this line is failing on timeout > 0
client.call_with_timeout(command, timeout)
end
end
I did
3.0.1 :003 > Worker.perform_async(221)
From: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1292 Redis#brpoplpush:
1288: def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
1289: synchronize do |client|
1290: command = [:brpoplpush, source, destination, timeout]
1291: require 'pry'; binding.pry
=> 1292: timeout += client.timeout if timeout > 0
1293: client.call_with_timeout(command, timeout)
1294: end
1295: end
[1] pry(#<Redis>)> deprecated_timeout
=> {:timeout=>0}
[2] pry(#<Redis>)> client.timeout
=> 5.0
[3] pry(#<Redis>)> timeout
=> {:timeout=>0}
[4] pry(#<Redis>)> timeout += client.timeout
NoMethodError: undefined method `+' for {:timeout=>0}:Hash
from (pry):4:in `block in brpoplpush'
[5] pry(#<Redis>)> timeout > 0
TypeError: no implicit conversion of Integer into Hash
from (pry):5:in `>'
[6] pry(#<Redis>)> this one is failing timeout > 0 because timeout is {:timeout=>0}
Is it possible that sidekiq unique jobs is passing timeout in wrong format down to redis-rb?
Must be, I'll see if I can replicate this in a test.
@ArturT worker configuration? Just need the sidekiq_options
really.
@ArturT I tried to replicate the problem either in the rails console or in tests.



Here is my worker:
class MyWorker
include Sidekiq::Worker
sidekiq_options queue: :my_queue_name,
# Locks when the client pushes the job to the queue.
# The queue will be unlocked when the server starts processing the job.
# The server then goes on to creating a runtime lock for the job to prevent
# simultaneous jobs from being executed. As soon as the server starts
# processing a job, the client can push the same job to the queue.
# https://github.com/mhenrixon/sidekiq-unique-jobs#until-and-while-executing
lock: :until_and_while_executing,
# search in logs SidekiqUniqueJobs keyword to find duplicates
log_duplicate_payload: true
def perform(id)
end
end
I'll try to debug it more later on when finding some time.
It was good you had me have a look at this because I discovered that I unintentionally created a breaking change that was fixed in 7.1.1 just now.
I can confirm v7.1.1 works fine now. Thanks :)
@mhenrixon I have a different error this time.
Some of the jobs are timeouting due to the issue to connect with Redis. It's recommended to use a higher timeout for RedisCloud addon on Heroku. So I use 5 seconds network_timeout
. But it seems like now the default timeout is 1 second (see error message):
ConnectionPool::TimeoutError: Waited 1 sec
+58 non-project frames
1
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 77 in block (2 levels) in pop
2
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in loop
3
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in block in pop
4
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in synchronize
5
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in pop
6
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 78 in checkout
7
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 60 in block in with
8
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in handle_interrupt
9
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in with
10
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq.rb" line 95 in redis
11
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/connection.rb" line 19 in redis
12
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/locksmith.rb" line 104 in execute
13
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/lock/until_executed.rb" line 30 in execute
14
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware/server.rb" line 25 in call
15
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 41 in block in call
16
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 123 in block in with_logging_context
17
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/logger.rb" line 11 in with
18
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in call
19
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in with_configured_loggers_context
20
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 122 in with_logging_context
21
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 40 in call
22
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
23
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-status-2.0.1/lib/sidekiq-status/server_middleware.rb" line 42 in call
24
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
25
File "/app/vendor/bundle/ruby/3.0.0/gems/rollbar-3.2.0/lib/rollbar/plugins/sidekiq/plugin.rb" line 11 in call
26
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
Is it possible that sidekiq unique jobs does not respect timeout 5 seconds anymore?
This is my config:
# config/initializers/sidekiq.rb
sidekiq_redis_config = {
url: Secret.sidekiq_redis_url,
namespace: REDIS_SIDEKIQ_NAMESPACE,
# tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
# https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
network_timeout: 5
}
Sidekiq.configure_server do |config|
# A Sidekiq server process requires at least (concurrency + 5) connections
# https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
config.redis = sidekiq_redis_config
...
end
...
@ArturT I can't think of a single thing. I've checked both the connection pool gem and the sidekiq gem and this behavior seems like a mystery to me.
I suspect this could be a coincident issue. Maybe RedisCloud had temporary issues yesterday/during this night that caused timeout and this revealed misconfiguration in my project.
I found out that I have a separate redis config for sidekiq server than sidekiq client. In case of sidekiq server config I had timeout: 5
option instead of network_timeout: 5
(I suspect the wrong key name could cause using default timeout 1s instead of 5s). I've updated the config and I also upgraded sidekiq-unique-jobs to 7.1.2. I hope this will help.
Thanks for reporting back @ArturT! I was always wondering why sidekiq uses network_timeout:
to set timeout:
on the redis connection in redis-rb. Seems like it would be better to just pass on timeout:
since that obviously is done anyway.
I was wrong about the misconfiguration in my config. In one place I use Redis.new. It requires timeout
key. The network_timeout
key is only specific for the Sidekiq.
I found in another place that I use ConnectionPool
so I added there missing timeout
key. But this seems like config for sidekiq client (not server) so I'm not sure if this will help because exceptions happened only for sidekiq workers (not web requests).
Sidekiq.configure_client do |config|
threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }
# Add 2 more connections to the pool just in case.
# This should help with error ConnectionPool::TimeoutError when connection is stuck.
# Connections are lazy loaded so they won't be opened unless necessary.
# https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
size = threads_count.to_i + 2
config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
...
end
Hi @ArturT, it seems like pool_timeout
in sidekiq translates to timeout
in connection pool and connect_timeout
translates to timeout
in Redis so indeed you must be using both. If you increase connect_timeout
to 5, then you would still have to (in the server block) configure pool_timeout
.
Super confusing, I can't think of a single time when I would like my pool timeout to be less than the connect timeout. Quite the contrary, maybe I'd like to wait for an additional 1 in the pool timeout.
sidekiq_redis_config = {
url: Secret.sidekiq_redis_url,
namespace: REDIS_SIDEKIQ_NAMESPACE,
# tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
# https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
network_timeout: 5,
pool_timeout: 5
}
Sidekiq.configure_server do |config|
# A Sidekiq server process requires at least (concurrency + 5) connections
# https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
config.redis = sidekiq_redis_config
end
redis_conn = lambda do
Redis.new(timeout: 5, url: Secret.sidekiq_redis_url, namespace: REDIS_SIDEKIQ_NAMESPACE)
end
Sidekiq.configure_client do |config|
threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }
# Add 2 more connections to the pool just in case.
# This should help with error ConnectionPool::TimeoutError when connection is stuck.
# Connections are lazy loaded so they won't be opened unless necessary.
# https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
size = threads_count.to_i + 2
config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
end
Try something like this @ArturT
@mhenrixon That's a good catch with pool_timeout
. I added it to my config. If I will see any exceptions I will report back. Thank you for the help.
We started seeing a lot of ConnectionPool::TimeoutError
errors after upgrading to 7.1.2. Downgrading to 7.0.12 resolved the problem. Our timeout is at 1s as well, but it used to be fine and it's fine again after the downgrade.
Is there any chance that a connection isn't checked back into the pool?
No that would not be the case @benedikt. I recommend you check what @ArturT has done above and the conclusions we came to while debugging his issues (that by the way was not related to this gem).
It could be that you need an additional connection due to the upgrade.
@mhenrixon I don't think increasing pool_timeout
is a necessarily a good solution. Yes, it might decrease the chance of getting ConnectionPool::TimeoutError
errors, but wouldn't it also increase total execution time? Sidekiq would have to sit there and wait for a connection to become available.
Please correct me if I'm wrong, but the way I understand it pool_timeout
is the amount of time Sidekiq waits to get a connection from the pool. Ideally there should always be enough connections for all threads around, so getting a connection should be (almost) instant once the pool is maxed out.
Sidekiq tries to automatically configure the pool to be big enough. Seeing the ConnectionPool::TimeoutError
pop up makes me think that it suddenly isn't big enough anymore?
It could be that you need an additional connection due to the upgrade.
Did something change in this release that uses an additional connection? #604 maybe? Is there a way to create a new connection outside of the pool for this?
Maybe there is something going on here in what @benedikt said. It seems like I had pool_timeout
1-second default forever until the upgrade to the new sidekiq-unique-jobs version revealed the issue ConnectionPool::TimeoutError
. Probably having a low pool_timeout=1
limit is just fine.
Is it possible that sidekiq creates not enough pool of connections and sidekiq-unique-jobs consumes them which prevents from opening new connections and the whole pool of connections could be busy hence `pool_timeout=1 limit is reached? Sidekiq could not take into account the needs of available connections needed by sidekiq-unique-jobs. Or maybe there is some leak of connections done by sidekiq-unique-jobs?
It seems like #604 is adding using a SidekiqUniqueJobs::TimerTask
, which under the hood uses a new Thread to repeatedly execute reaper_registered?
. This increases the total number of threads that use the pool.
I believe Sidekiq itself needs two additional connections on top of the number of worker threads it creates.
With the SidekiqUniqueJobs::RepearResurrector
and the SidekiqUniqueJobs::OrphanManager
each creating an additional thread, the pool is too small. When all worker threads, Sidekiq itself, the RepearResurrector
and the OrphanManager
try to use a Redis connection, one of them will have to wait for a new connection to become available.
My guess is that this was okay until now because Sidekiq doesn't really use one of the additional two connections, so it was fine until the SidekiqUniqueJobs::RepearResurrector
was added.
Point taken, I'll have a look at making sure we only use a connection when needed.
There is no leaking of connections from what I can tell. That said, it does require more redis connection for the extra work. If you didn't already have a ton of trouble with the reaper, you can turn the resurrector off with SidekiqUniqueJobs.config.reaper_resurrector_enabled = false
Let me know if that helps
Upgraded to the latest version and disabled the reaper resurrector. No issues so far 👍
Thank you for the confirmation @benedikt! I will disable it per default in the next release and refactor the resurrector.
Just to let you know despite a pool timeout 5 seconds I got last night's exception ConnectionPool::TimeoutError: Waited 5 sec
after a week.
We've got a busy sidekiq worker pool, and I'm also seeing the same ConnectionPool::TimeoutError
issues as others under version 7.1.2 of this gem.
We've tried setting the sidekiq connection pool to a larger version than the default with some code like this, but we're still seeing this issue.
Sidekiq.configure_server do |config|
# as of sidekiq-unique-jobs 7.1.1, the SidekiqUniqueJobs::RepearResurrector and the SidekiqUniqueJobs::OrphanManager
# each need a connection beyond the thread concurrency count + 5 connections required for the standard operation of
# sidekiq.
#
# See: https://github.com/mhenrixon/sidekiq-unique-jobs/issues/617
config.redis = sidekiq_redis_config.merge(size: Sidekiq.options[:concurrency] + 7)
config.client_middleware do |chain|
chain.add SidekiqUniqueJobs::Middleware::Client
end
config.server_middleware do |chain|
chain.add SidekiqUniqueJobs::Middleware::Server
end
SidekiqUniqueJobs::Server.configure(config)
end
However this does not seem to have helped too much. Our experience of the issue is that the rate of exceptions start a few hours after a new code deploy / sidekiq process restart and then accelerate from there before the sidekiq worker instance is unable to find connections to process its own jobs. It feels to us like there must be some issue that causes the connections to be used but not returned to the pool with this version of the gem.
We're planning on disabling the reaper process as our next attempt at correcting this issue.
@woodhull https://github.com/mhenrixon/sidekiq-unique-jobs/issues/617#issuecomment-876391761
@mhenrixon First off, thanks for the awesome gem!
We were seeing this issue in production after upgrading and had to revert back to our much older version once we started noticing jobs being dropped.
I believe I was able to isolate the circumstance by introducing latency in the redis connection for until_and_while_executing
jobs.
It appears that if the latency is somewhere around 18ms, then the job will be rejected. This was tested over localhost on OSX, sidekiq-unique-jobs version 7.1.2, SidekiqUniqueJobs.config.reaper_resurrector_enabled = false
in the initializer.
gem 'sidekiq-ent', '~> 1.8.1'
gem 'sidekiq-pro', '~> 5.0.0'
gem 'sidekiq-unique-jobs', '~> 7.1.2'
the sidekiq portion of this test is:
class TestWorker
include Sidekiq::Worker
def perform(timestamp)
Rails.logger.info("performing")
File.write('/tmp/timestamp', timestamp)
sleep(10)
end
end
class UniqueUntilAndWhileExecutingTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_and_while_executing
end
class UniqueUntilExecutingTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_executing
end
class UniqueUntilExecutedTestWorker < TestWorker
sidekiq_options queue: :medium, lock: :until_executed
end
And executed in a rails console
def test(worker)
timestamp = Time.now.to_i.to_s
p worker.perform_async(timestamp)
puts "check #{timestamp}?"
gets
p File.read('/tmp/timestamp') == timestamp
end
test(TestWorker)
test(UniqueUntilAndWhileExecutingTestWorker)
test(UniqueUntilExecutingTestWorker)
test(UniqueUntilExecutedTestWorker)
The ruby TCP proxy script used to introduce latency is in the following gist: https://gist.github.com/millerjs/8ab3a4ce63c4502da4c34db017ae417d
ruby ~/bp/scratch/latency_proxy.rb localhost:6379 8379
with the following results
- test(TestWorker)
- 300ms latency, fails 0/10
- test(UniqueUntilAndWhileExecutingTestWorker)
| latency | failure rate (fails / trys) | |
|---------+-----------------------------+------------|
| 0 ms | fails 0/1 | |
| 10 ms | fails 0/2 | |
| 15 ms | fails 0/2 | |
| 18 ms | fails 3/4 | threshold? |
| 20 ms | fails 3/3 | |
| 25 ms | fails 1/1 | |
| 50 ms | fails 1/1 | |
| 150 ms | fails 1/2 | |
| 230 ms | fails 1/1 | |
| 300 ms | fails 4/4 | |
- test(UniqueUntilExecutingTestWorker)
- 300ms latency, fails 0/3
- test(UniqueUntilExecutedTestWorker)
- 300ms latency, fails 0/3
I'm happy to provide any more info, logs, redis keys, etc if that will help debug this issue
edit: If I crank CLOCK_DRIFT_FACTOR
to 1.0
, this seems to "resolve" the latency issue
edit 2: does this just mean I need to set a high lock_ttl
, if so, why just until_and_while_executing
?
edit 3: looks like runtime_lock.execute { return yield }
is failing to acquire. I checked WhileExecuting, and it appears to have the same issue in this scenario
@mhenrixon the README says lock_timeout: nil # lock indefinitely
, but https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L273-L277 executes a non blocking call if config.timeout
is nil
.
Are these two things contradictory?
edit 1
it looks like method(:primed_async)
in
https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L101-L107
isn't respecting any lock_timeout wait settings. If .lock_timeout
is a positive integer, then the blocking pop will succeed, but the promise will have already returned a nil value
https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L239-L243
so the primed_jid
will be nil and the execute
block will not be yielded (i.e. the job will be silently dropped)
https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L260-L261
this explains why upping the ttl
or the CLOCK_DRIFT_FACTOR
will resolve the latency-dropped-job issue while upping the lock_timeout
does not.