sidekiq-unique-jobs Please try v7.1.5

I added an UPGRADING.md document that I will use for guiding people on how to upgrade.

A ton of bug fixes went into this release! Even UntilAndWhileExecuting should be less problematic now.

Finally a special shoutout to my sponsors. You make all of this work possible and worth it!

Jun 29 '21 14:06 mhenrixon

After the gem upgrade I noticed this issue (in rails console):

3.0.1 :001 > Worker.perform_async("223")
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)
3.0.1 :002 > Worker.perform_async(223)
/Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291:in `>': no implicit conversion of Integer into Hash (TypeError)

I've checked source code of: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1291

  def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
    synchronize do |client|
      command = [:brpoplpush, source, destination, timeout]
      require 'pry'; binding.pry # I added it for debugging
      timeout += client.timeout if timeout > 0 # this line is failing on timeout > 0
      client.call_with_timeout(command, timeout)
    end
  end

I did

3.0.1 :003 > Worker.perform_async(221)

From: /Users/artur/.rvm/gems/ruby-3.0.1/gems/redis-4.3.1/lib/redis.rb:1292 Redis#brpoplpush:

    1288: def brpoplpush(source, destination, deprecated_timeout = 0, timeout: deprecated_timeout)
    1289:   synchronize do |client|
    1290:     command = [:brpoplpush, source, destination, timeout]
    1291:     require 'pry'; binding.pry
 => 1292:     timeout += client.timeout if timeout > 0
    1293:     client.call_with_timeout(command, timeout)
    1294:   end
    1295: end

[1] pry(#<Redis>)> deprecated_timeout
=> {:timeout=>0}
[2] pry(#<Redis>)> client.timeout
=> 5.0
[3] pry(#<Redis>)> timeout
=> {:timeout=>0}
[4] pry(#<Redis>)> timeout += client.timeout
NoMethodError: undefined method `+' for {:timeout=>0}:Hash
from (pry):4:in `block in brpoplpush'
[5] pry(#<Redis>)> timeout > 0
TypeError: no implicit conversion of Integer into Hash
from (pry):5:in `>'
[6] pry(#<Redis>)> this one is failing timeout > 0 because timeout is  {:timeout=>0}

Is it possible that sidekiq unique jobs is passing timeout in wrong format down to redis-rb?

Jun 30 '21 09:06 ArturT

Must be, I'll see if I can replicate this in a test.

Jun 30 '21 10:06 mhenrixon

@ArturT worker configuration? Just need the sidekiq_options really.

Jun 30 '21 11:06 mhenrixon

@ArturT I tried to replicate the problem either in the rails console or in tests.

Jun 30 '21 11:06 mhenrixon

Here is my worker:

class MyWorker
    include Sidekiq::Worker

    sidekiq_options queue: :my_queue_name,
      # Locks when the client pushes the job to the queue.
      # The queue will be unlocked when the server starts processing the job.
      # The server then goes on to creating a runtime lock for the job to prevent
      # simultaneous jobs from being executed. As soon as the server starts
      # processing a job, the client can push the same job to the queue.
      # https://github.com/mhenrixon/sidekiq-unique-jobs#until-and-while-executing
      lock: :until_and_while_executing,
      # search in logs SidekiqUniqueJobs keyword to find duplicates
      log_duplicate_payload: true

    def perform(id)
    end
end

I'll try to debug it more later on when finding some time.

Jun 30 '21 12:06 ArturT

It was good you had me have a look at this because I discovered that I unintentionally created a breaking change that was fixed in 7.1.1 just now.

Jun 30 '21 13:06 mhenrixon

I can confirm v7.1.1 works fine now. Thanks :)

Jun 30 '21 17:06 ArturT

@mhenrixon I have a different error this time.

Some of the jobs are timeouting due to the issue to connect with Redis. It's recommended to use a higher timeout for RedisCloud addon on Heroku. So I use 5 seconds network_timeout. But it seems like now the default timeout is 1 second (see error message):

ConnectionPool::TimeoutError: Waited 1 sec
+58 non-project frames
1
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 77 in block (2 levels) in pop
2
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in loop
3
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 69 in block in pop
4
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in synchronize
5
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool/timed_stack.rb" line 68 in pop
6
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 78 in checkout
7
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 60 in block in with
8
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in handle_interrupt
9
File "/app/vendor/bundle/ruby/3.0.0/gems/connection_pool-2.2.5/lib/connection_pool.rb" line 59 in with
10
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq.rb" line 95 in redis
11
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/connection.rb" line 19 in redis
12
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/locksmith.rb" line 104 in execute
13
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/lock/until_executed.rb" line 30 in execute
14
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware/server.rb" line 25 in call
15
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 41 in block in call
16
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 123 in block in with_logging_context
17
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/logger.rb" line 11 in with
18
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in call
19
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 138 in with_configured_loggers_context
20
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/logging.rb" line 122 in with_logging_context
21
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-unique-jobs-7.1.1/lib/sidekiq_unique_jobs/middleware.rb" line 40 in call
22
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
23
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-status-2.0.1/lib/sidekiq-status/server_middleware.rb" line 42 in call
24
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke
25
File "/app/vendor/bundle/ruby/3.0.0/gems/rollbar-3.2.0/lib/rollbar/plugins/sidekiq/plugin.rb" line 11 in call
26
File "/app/vendor/bundle/ruby/3.0.0/gems/sidekiq-6.2.1/lib/sidekiq/middleware/chain.rb" line 140 in block in invoke

Is it possible that sidekiq unique jobs does not respect timeout 5 seconds anymore?

This is my config:

# config/initializers/sidekiq.rb

sidekiq_redis_config = {
  url: Secret.sidekiq_redis_url,
  namespace: REDIS_SIDEKIQ_NAMESPACE,
  # tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
  network_timeout: 5
}

Sidekiq.configure_server do |config|
  # A Sidekiq server process requires at least (concurrency + 5) connections
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
  config.redis = sidekiq_redis_config

...
end

...

Jul 01 '21 21:07 ArturT

@ArturT I can't think of a single thing. I've checked both the connection pool gem and the sidekiq gem and this behavior seems like a mystery to me.

Jul 02 '21 04:07 mhenrixon

I suspect this could be a coincident issue. Maybe RedisCloud had temporary issues yesterday/during this night that caused timeout and this revealed misconfiguration in my project.

I found out that I have a separate redis config for sidekiq server than sidekiq client. In case of sidekiq server config I had timeout: 5 option instead of network_timeout: 5 (I suspect the wrong key name could cause using default timeout 1s instead of 5s). I've updated the config and I also upgraded sidekiq-unique-jobs to 7.1.2. I hope this will help.

Jul 02 '21 07:07 ArturT

Thanks for reporting back @ArturT! I was always wondering why sidekiq uses network_timeout: to set timeout: on the redis connection in redis-rb. Seems like it would be better to just pass on timeout: since that obviously is done anyway.

Jul 02 '21 07:07 mhenrixon

I was wrong about the misconfiguration in my config. In one place I use Redis.new. It requires timeout key. The network_timeout key is only specific for the Sidekiq.

I found in another place that I use ConnectionPool so I added there missing timeout key. But this seems like config for sidekiq client (not server) so I'm not sure if this will help because exceptions happened only for sidekiq workers (not web requests).

Sidekiq.configure_client do |config|
  threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }

  # Add 2 more connections to the pool just in case.
  # This should help with error ConnectionPool::TimeoutError when connection is stuck.
  # Connections are lazy loaded so they won't be opened unless necessary.
  # https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
  size = threads_count.to_i + 2

  config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)

...
end

Jul 02 '21 08:07 ArturT

Hi @ArturT, it seems like pool_timeout in sidekiq translates to timeout in connection pool and connect_timeout translates to timeout in Redis so indeed you must be using both. If you increase connect_timeout to 5, then you would still have to (in the server block) configure pool_timeout.

Super confusing, I can't think of a single time when I would like my pool timeout to be less than the connect timeout. Quite the contrary, maybe I'd like to wait for an additional 1 in the pool timeout.

Jul 02 '21 11:07 mhenrixon

sidekiq_redis_config = {
  url: Secret.sidekiq_redis_url,
  namespace: REDIS_SIDEKIQ_NAMESPACE,
  # tune network timeouts to be a little more lenient when you are seeing occasional timeout errors for Heroku Redis Cloud addon
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
  network_timeout: 5,
  pool_timeout: 5
}

Sidekiq.configure_server do |config|
  # A Sidekiq server process requires at least (concurrency + 5) connections
  # https://github.com/mperham/sidekiq/wiki/Using-Redis#complete-control
  config.redis = sidekiq_redis_config
end

redis_conn  = lambda do
  Redis.new(timeout: 5, url: Secret.sidekiq_redis_url, namespace: REDIS_SIDEKIQ_NAMESPACE)
end

Sidekiq.configure_client do |config|
  threads_count = ENV.fetch('RAILS_MAX_THREADS') { 1 }

  # Add 2 more connections to the pool just in case.
  # This should help with error ConnectionPool::TimeoutError when connection is stuck.
  # Connections are lazy loaded so they won't be opened unless necessary.
  # https://github.com/mhenrixon/sidekiq-unique-jobs/issues/535
  size = threads_count.to_i + 2

  config.redis = ConnectionPool.new(size: size, timeout: 5, &redis_conn)
end

Try something like this @ArturT

Jul 02 '21 11:07 mhenrixon

@mhenrixon That's a good catch with pool_timeout. I added it to my config. If I will see any exceptions I will report back. Thank you for the help.

Jul 02 '21 11:07 ArturT

We started seeing a lot of ConnectionPool::TimeoutError errors after upgrading to 7.1.2. Downgrading to 7.0.12 resolved the problem. Our timeout is at 1s as well, but it used to be fine and it's fine again after the downgrade.

Is there any chance that a connection isn't checked back into the pool?

Jul 05 '21 09:07 benedikt

No that would not be the case @benedikt. I recommend you check what @ArturT has done above and the conclusions we came to while debugging his issues (that by the way was not related to this gem).

It could be that you need an additional connection due to the upgrade.

Jul 05 '21 11:07 mhenrixon

@mhenrixon I don't think increasing pool_timeout is a necessarily a good solution. Yes, it might decrease the chance of getting ConnectionPool::TimeoutError errors, but wouldn't it also increase total execution time? Sidekiq would have to sit there and wait for a connection to become available.

Please correct me if I'm wrong, but the way I understand it pool_timeout is the amount of time Sidekiq waits to get a connection from the pool. Ideally there should always be enough connections for all threads around, so getting a connection should be (almost) instant once the pool is maxed out.

Sidekiq tries to automatically configure the pool to be big enough. Seeing the ConnectionPool::TimeoutError pop up makes me think that it suddenly isn't big enough anymore?

It could be that you need an additional connection due to the upgrade.

Did something change in this release that uses an additional connection? #604 maybe? Is there a way to create a new connection outside of the pool for this?

Jul 05 '21 11:07 benedikt

Maybe there is something going on here in what @benedikt said. It seems like I had pool_timeout 1-second default forever until the upgrade to the new sidekiq-unique-jobs version revealed the issue ConnectionPool::TimeoutError. Probably having a low pool_timeout=1 limit is just fine.

Is it possible that sidekiq creates not enough pool of connections and sidekiq-unique-jobs consumes them which prevents from opening new connections and the whole pool of connections could be busy hence `pool_timeout=1 limit is reached? Sidekiq could not take into account the needs of available connections needed by sidekiq-unique-jobs. Or maybe there is some leak of connections done by sidekiq-unique-jobs?

Jul 05 '21 11:07 ArturT

It seems like #604 is adding using a SidekiqUniqueJobs::TimerTask, which under the hood uses a new Thread to repeatedly execute reaper_registered?. This increases the total number of threads that use the pool.

I believe Sidekiq itself needs two additional connections on top of the number of worker threads it creates.

With the SidekiqUniqueJobs::RepearResurrector and the SidekiqUniqueJobs::OrphanManager each creating an additional thread, the pool is too small. When all worker threads, Sidekiq itself, the RepearResurrector and the OrphanManager try to use a Redis connection, one of them will have to wait for a new connection to become available.

My guess is that this was okay until now because Sidekiq doesn't really use one of the additional two connections, so it was fine until the SidekiqUniqueJobs::RepearResurrector was added.

Jul 05 '21 12:07 benedikt

Point taken, I'll have a look at making sure we only use a connection when needed.

Jul 05 '21 12:07 mhenrixon

There is no leaking of connections from what I can tell. That said, it does require more redis connection for the extra work. If you didn't already have a ton of trouble with the reaper, you can turn the resurrector off with SidekiqUniqueJobs.config.reaper_resurrector_enabled = false

Jul 05 '21 12:07 mhenrixon

Let me know if that helps

Jul 05 '21 12:07 mhenrixon

Upgraded to the latest version and disabled the reaper resurrector. No issues so far 👍

Jul 08 '21 12:07 benedikt

Thank you for the confirmation @benedikt! I will disable it per default in the next release and refactor the resurrector.

Jul 08 '21 12:07 mhenrixon

Just to let you know despite a pool timeout 5 seconds I got last night's exception ConnectionPool::TimeoutError: Waited 5 sec after a week.

Jul 10 '21 09:07 ArturT

We've got a busy sidekiq worker pool, and I'm also seeing the same ConnectionPool::TimeoutError issues as others under version 7.1.2 of this gem.

We've tried setting the sidekiq connection pool to a larger version than the default with some code like this, but we're still seeing this issue.

Sidekiq.configure_server do |config|
  # as of sidekiq-unique-jobs 7.1.1, the SidekiqUniqueJobs::RepearResurrector and the SidekiqUniqueJobs::OrphanManager
  # each need a connection beyond the thread concurrency count + 5 connections required for the standard operation of
  # sidekiq.
  #
  # See: https://github.com/mhenrixon/sidekiq-unique-jobs/issues/617
  config.redis = sidekiq_redis_config.merge(size: Sidekiq.options[:concurrency] + 7)
  
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  SidekiqUniqueJobs::Server.configure(config)
end

However this does not seem to have helped too much. Our experience of the issue is that the rate of exceptions start a few hours after a new code deploy / sidekiq process restart and then accelerate from there before the sidekiq worker instance is unable to find connections to process its own jobs. It feels to us like there must be some issue that causes the connections to be used but not returned to the pool with this version of the gem.

We're planning on disabling the reaper process as our next attempt at correcting this issue.

Jul 15 '21 15:07 woodhull

@woodhull https://github.com/mhenrixon/sidekiq-unique-jobs/issues/617#issuecomment-876391761

Jul 15 '21 19:07 mhenrixon

@mhenrixon First off, thanks for the awesome gem!

We were seeing this issue in production after upgrading and had to revert back to our much older version once we started noticing jobs being dropped.

I believe I was able to isolate the circumstance by introducing latency in the redis connection for until_and_while_executing jobs.

It appears that if the latency is somewhere around 18ms, then the job will be rejected. This was tested over localhost on OSX, sidekiq-unique-jobs version 7.1.2, SidekiqUniqueJobs.config.reaper_resurrector_enabled = false in the initializer.

gem 'sidekiq-ent', '~> 1.8.1'
gem 'sidekiq-pro', '~> 5.0.0'
gem 'sidekiq-unique-jobs', '~> 7.1.2'

the sidekiq portion of this test is:

class TestWorker
  include Sidekiq::Worker

  def perform(timestamp)
    Rails.logger.info("performing")
    File.write('/tmp/timestamp', timestamp)
    sleep(10)
  end
end

class UniqueUntilAndWhileExecutingTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_and_while_executing
end

class UniqueUntilExecutingTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_executing
end

class UniqueUntilExecutedTestWorker < TestWorker
  sidekiq_options queue: :medium, lock: :until_executed
end

And executed in a rails console

def test(worker)
  timestamp = Time.now.to_i.to_s
  p worker.perform_async(timestamp)
  puts "check #{timestamp}?"
  gets
  p File.read('/tmp/timestamp') == timestamp
end

test(TestWorker)
test(UniqueUntilAndWhileExecutingTestWorker)
test(UniqueUntilExecutingTestWorker)
test(UniqueUntilExecutedTestWorker)

The ruby TCP proxy script used to introduce latency is in the following gist: https://gist.github.com/millerjs/8ab3a4ce63c4502da4c34db017ae417d

ruby ~/bp/scratch/latency_proxy.rb localhost:6379 8379

with the following results

  - test(TestWorker)
    - 300ms latency, fails 0/10

  - test(UniqueUntilAndWhileExecutingTestWorker)

    | latency | failure rate (fails / trys) |            |
    |---------+-----------------------------+------------|
    | 0 ms    | fails 0/1                   |            |
    | 10 ms   | fails 0/2                   |            |
    | 15 ms   | fails 0/2                   |            |
    | 18 ms   | fails 3/4                   | threshold? |
    | 20 ms   | fails 3/3                   |            |
    | 25 ms   | fails 1/1                   |            |
    | 50 ms   | fails 1/1                   |            |
    | 150 ms  | fails 1/2                   |            |
    | 230 ms  | fails 1/1                   |            |
    | 300 ms  | fails 4/4                   |            |

  - test(UniqueUntilExecutingTestWorker)
    - 300ms latency, fails 0/3

  - test(UniqueUntilExecutedTestWorker)
    - 300ms latency, fails 0/3

I'm happy to provide any more info, logs, redis keys, etc if that will help debug this issue

edit: If I crank CLOCK_DRIFT_FACTOR to 1.0, this seems to "resolve" the latency issue edit 2: does this just mean I need to set a high lock_ttl, if so, why just until_and_while_executing? edit 3: looks like runtime_lock.execute { return yield } is failing to acquire. I checked WhileExecuting, and it appears to have the same issue in this scenario

Jul 19 '21 20:07 millerjs

@mhenrixon the README says lock_timeout: nil # lock indefinitely, but https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L273-L277 executes a non blocking call if config.timeout is nil.

Are these two things contradictory?

edit 1

it looks like method(:primed_async) in

https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L101-L107

isn't respecting any lock_timeout wait settings. If .lock_timeout is a positive integer, then the blocking pop will succeed, but the promise will have already returned a nil value

https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L239-L243

so the primed_jid will be nil and the execute block will not be yielded (i.e. the job will be silently dropped)

https://github.com/mhenrixon/sidekiq-unique-jobs/blob/620fe7503f1d0395cad337d2e45681504913c7ef/lib/sidekiq_unique_jobs/locksmith.rb#L260-L261

this explains why upping the ttl or the CLOCK_DRIFT_FACTOR will resolve the latency-dropped-job issue while upping the lock_timeout does not.

Jul 20 '21 14:07 millerjs

sidekiq-unique-jobs sidekiq-unique-jobs copied to clipboard

Please try v7.1.5

edit 1

sidekiq-unique-jobs
sidekiq-unique-jobs copied to clipboard