sidekiq-unique-jobs icon indicating copy to clipboard operation
sidekiq-unique-jobs copied to clipboard

uniquejobs:digests sorted set seems to grow forever

Open JeremiahChurch opened this issue 1 year ago • 3 comments

Describe the bug Our prod uniquejobs:digests sorted set in redis grew to 3GB in about 3 weeks. (~5mil jobs/day, less than a 1000 total jobs in queues and dead job queue during screenshot time)

image

our lock TTLs are at max 6 hours - the vast majority are 5 minutes.

Expected behavior my understanding is that we should clean the digests as conditions occur (mostly when our jobs exit successfully) or at worst when the reaper runs.

Current behavior the uniquejobs:digests sorted set grows until we run out of redis ram

Worker class 47 different jobs that have a lock on them. the only locks we use are: until_and_while_executing, until_executing, & until_executed, 95% of them are until_and_while_executing

# our entire sidekiq config
require 'sidekiq'
require 'sidekiq-unique-jobs'

Sidekiq.default_job_options = { 'backtrace' => true, 'retry' => 15 }
Sidekiq.strict_args!

SidekiqUniqueJobs.configure do |config|
  config.lock_info = true
  config.lock_prefix = 'prod_uniq' # new value
  config.lock_ttl = 5.minutes # default for anything - any longer jobs should have one specified
  config.enabled = !Rails.env.test?
  config.logger_enabled = false
  config.debug_lua = false
  config.max_history = 10_000
  config.reaper          = :ruby # :ruby, :lua or :none/nil
  config.reaper_count    = 50 # Stop reaping after this many keys
  config.reaper_interval = 305 # Reap orphans every 5 minutes
  config.reaper_timeout  = 30
end

Sidekiq.default_configuration.redis = { url: ENV['REDIS_URL'] || 'redis://localhost:6379/0', network_timeout: 3 } # relax redis timeouts a bit default is 1
Sidekiq.configure_server do |config|
  if config.queues == ['default'] 
    concurrency = (ENV['SIDEKIQ_CONCURRENCY'] || (Rails.env.development? ? 7 : 23)).to_i

    config.queues = %w[af,10 o,8 ws,5 r,5 s,4 t,4 sl,3 searchkick,2 sd,1 c,1]
    config.concurrency = concurrency

    config.capsule('limited') do |cap|
      cap.concurrency = Rails.env.production? ? (concurrency / 3) : 1
      cap.queues = %w[af,10 wms,4 t,4 searchkick,3 sd,1] 
    end

    config.capsule('single') do |cap|
      cap.concurrency = 1
      cap.queues = %w[counters,1] 
    end
  end

  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end

  config.server_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Server
  end

  config.logger.level = ENV.fetch('SIDEKIQ_LOG_LEVEL', Logger::INFO) if Rails.env.production?

  SidekiqUniqueJobs::Server.configure(config)
end

Sidekiq.configure_client do |config|
  config.client_middleware do |chain|
    chain.add SidekiqUniqueJobs::Middleware::Client
  end
end

Additional context We're generally running the top of main from a version perspective. currently 8.0.6. sidekiq 7.1.6 currently, rails 7.0.8.

This is the second or 3rd time that we've seen the issue cropped up, not sure if it's been introduced recently or if it's always been there and we just haven't noticed until recently.

Failures or jobs exiting because of an exception or other 'non normal' exit are less than 0.1% of all jobs run.

I've been through the reaper issues, found some similar issues but seemingly nothing exact.

As always, huge love for the gem <3

JeremiahChurch avatar Nov 21 '23 19:11 JeremiahChurch

looking at details on #637 as it seems very similar

JeremiahChurch avatar Dec 19 '23 21:12 JeremiahChurch

@JeremiahChurch I believe this have improved with https://github.com/mhenrixon/sidekiq-unique-jobs/pull/774/commits/cddcc08e75661ee1f9947de28f337666fa292b07 and those changes should be on the main branch.

I have also tweaked the reaper a bit.

mhenrixon avatar Feb 07 '24 10:02 mhenrixon

Hey everybody,

we do currently observe the same behavior of a growing digests set in one of our environments. All environments use

  • sidekiq-unique-jobs 7.1.33
  • rails 7.0.8
  • sidekiq 6.5.8

The interesting thing is that on 3 environments we have the same application but we can only observe the behavior on one of them - but its working fine in the other two.

  • Not impacted: a low traffic testing environment, a mid traffic production environment
  • Impacted: a high traffic production environment

@JeremiahChurch have you been able to identify/fix the issue? I am wondering if the reaper is maybe actually working but just can't keep up with the amount of locks to be removed.

Maybe somebody has a hint on debugging this issue?

MarkusHarmsen avatar Aug 05 '24 09:08 MarkusHarmsen