delayed_job icon indicating copy to clipboard operation
delayed_job copied to clipboard

With 500K jobs in the delayed_jobs table, it gets really slow

Open amitsaxena opened this issue 11 years ago • 31 comments

With around 500K jobs in the delayed_jobs table, it gets really slow, to the point of being unusable. My slow query log is filled with these:

SET timestamp=1379681371; UPDATE delayed_jobs SET locked_at = '2013-09-20 12:49:20', locked_by = 'delayed_job host:node1365 pid:20668' WHERE ((run_at <= '2013-09-20 12:49:20' AND (locked_at IS NULL OR locked_at < '2013-09-20 08:49:20') OR locked_by = 'delayed_job host:node1365 pid:20668') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1; // Time: 130920 8:49:36 // Query_time: 4.683968 Lock_time: 0.000069 Rows_sent: 0 Rows_examined: 1 SET timestamp=1379681376; DELETE FROM delayed_jobs WHERE delayed_jobs.id = 5233997; // Time: 130920 8:50:55 // Query_time: 6.367763 Lock_time: 0.000082 Rows_sent: 3 Rows_examined: 595935 SET timestamp=1379681455; SELECT COUNT(*) AS count_all, priority AS priority FROM delayed_jobs WHERE (run_at < '2013-09-20 12:50:48' and failed_at is NULL) GROUP BY priority;

I have the following indexes (added some more to speed up some queries that were showing up in the slow query log): mysql> show index from delayed_jobs; +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | delayed_jobs | 0 | PRIMARY | 1 | id | A | 628375 | NULL | NULL | | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 1 | priority | A | 16 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 2 | run_at | A | 24168 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 1 | locked_at | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 2 | locked_by | A | 9 | 100 | NULL | YES | BTREE | | | delayed_jobs | 1 | locked_by | 1 | locked_by | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | failed_at | 1 | failed_at | A | 97 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | run_at | 1 | run_at | A | 13660 | NULL | NULL | YES | BTREE | | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+

But still, it is extremely slow.

amitsaxena avatar Sep 20 '13 12:09 amitsaxena

In case someone is looking for a quickfix to clear the queue, here is how we did it: http://aawaara.com/post/61798959791/break-it-down

Those were time sensitive jobs, and we needed to get them out of the queue.

amitsaxena avatar Sep 21 '13 23:09 amitsaxena

@amitsaxena have you considered using the priority column? @banderous DelayedJob has many advantages over other worker solutions for ruby. One of the things I like least about it is the interaction between running jobs and queueing jobs. Sounds like this would be an upgrade in this dimension. I'd vote for it.

mattbornski avatar Oct 13 '13 17:10 mattbornski

@mattbornski yes I use the priority column - it's just that there are so many database queries by the workers (for polling the queue, updating timestamps, host, etc.), that mysql becomes a bottleneck.

amitsaxena avatar Oct 13 '13 19:10 amitsaxena

You might also try the delayed_job_redis (or one of the other backends).

inspire22 avatar Oct 17 '13 22:10 inspire22

@inspire22 we have moved to sidekiq now.

amitsaxena avatar Oct 21 '13 08:10 amitsaxena

Can we learn something from que?

xpepermint avatar Jan 30 '15 18:01 xpepermint

Que only works with postgresql. So nope.

albus522 avatar Jan 30 '15 18:01 albus522

I am suffering this issue now:

  • Rails 4.2.6

  • DelayedJob 4.1.2

  • Workers: 4

    select count(*) from delayed_jobs; // -> 279105

    SQL (3674.6ms) UPDATE delayed_jobs SET delayed_jobs.locked_at = '2016-06-06 08:16:47', delayed_jobs.locked_by = 'delayed_job host:XXX pid:32370' WHERE ((run_at <= '2016-06-06 08:16:47.546325' AND (locked_at IS NULL OR locked_at < '2016-06-06 04:16:47.546367') OR locked_by = 'delayed_job host:XXX pid:32370') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

I have tried to add more worker but it just make the problem worst

fguillen avatar Jun 06 '16 08:06 fguillen

You can try Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql Using that can effect performance both ways, but for some people it helps.

Also is that 280k active jobs or is that mostly old failed jobs?

If all those are active what is your average job runtime? DJ is not the ideal candidate if you are running massive numbers of very fast jobs.

If they are old failed jobs, do you really still need them?

albus522 avatar Jun 06 '16 14:06 albus522

Also is that 280k active jobs or is that mostly old failed jobs?

All active

If all those are active what is your average job runtime?

Less than a second, but enough to want to put them out of the request.

DJ is not the ideal candidate if you are running massive numbers of very fast jobs.

Good to know

Thanks @albus522

fguillen avatar Jun 09 '16 20:06 fguillen

SO thread: http://stackoverflow.com/q/40360455/316700

fguillen avatar Nov 01 '16 12:11 fguillen

On MySQL at least, this appears to be due to the LIMIT 1 in several UPDATE queries. See this blog post for a bit more explanation.

I dropped my delayed_job table from ~220k to ~9k rows (lots of failures due to an application bug) and reduced MySQL CPU usage from 80% to 15%.

DanielRussell avatar Mar 28 '17 20:03 DanielRussell

In Postgres, you can significantly speed it up just with 2 trivial steps:

  1. Use SELECT ... FOR UPDATE SKIP LOCKED instead of just SELECT ... FOR UPDATE (the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb)

  2. Add this index:

CREATE INDEX i_delayed_jobs_priority_partial
  ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;

NikolayS avatar Mar 28 '17 22:03 NikolayS

SKIP LOCKED was introduced in postgres 9.5, so that feature requires a relatively new postgres install and would require us to make our code even more complicated with db version detection so things don't break for some uesrs.

Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far. They have generally worked for a users specific case but not all and this index is very specific to postgres.

albus522 avatar Mar 29 '17 13:03 albus522

@albus522 thank you for your response, but it was not a change request, it was advice for those who uses DJ with Postgres (I hope with 9.6 or at least 9.5, otherwise one more advice -- upgrade!).

Without those two improvements, you will have big performance issues if you need to process millions of jobs per day.

I considered to create a pull request but changed my mind when saw how many of PRs are left unprocessed for years in this repo.

NikolayS avatar Mar 29 '17 15:03 NikolayS

Indexing is definitely the way to go. Here's what ours looks like:

add_index "delayed_jobs", ["deleted_at", "locked_at", "failed_at", "queue", "run_at", "locked_by"], name: "deleted_at_locked_at_failed_at_queue_run_at_locked_by", using: :btree

With about 200,000 rows, our queries went from about 500ms to 0.7ms, a 714x improvement.

joshuapinter avatar Feb 15 '19 08:02 joshuapinter

@joshuapinter that's a huge composite index and I doubt if most of the fields there are really adding any speed to your queries.

It's been quite long since I moved away from delayed_job, but try having an index on just the most selective column (or may be two if one is not good enough) in the query and that should give you similar performance gains. Once you are down from say a million rows to 100 (using a selective column index), the extra columns in index don't really add much value. It just makes your write costly.

Looking at the queries above (unless they have changed in newer versions), probably multiple single column indexes make more sense for this than one huge multi column composite index.

amitsaxena avatar Feb 16 '19 02:02 amitsaxena

@amitsaxena You're right. I'm gonna run some quick tests there and see where I can stop and still achieve the same performance improvements.

joshuapinter avatar Feb 16 '19 02:02 joshuapinter

@amitsaxena Ha! You're right. I got the same performance improvement just by using a single deleted_at column in the index, which makes sense because that filters things down to single digit records instead of 200,000+.

Thanks for the tip and reminder!

joshuapinter avatar Feb 16 '19 02:02 joshuapinter

Glad that I could be of help! :)

amitsaxena avatar Feb 16 '19 03:02 amitsaxena

@albus522 isn't it time to reconsider it?

SKIP LOCKED was introduced in postgres 9.5, so that feature requires a relatively new postgres install

Less than in one year, 9.5 will become the oldest community-supported version of Postgres, see https://www.postgresql.org/support/versioning/. Already now the most versions which are supported by PGDG have SKIP LOCKED feature. And MySQL also has it since version 8.

Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far.

Have you checked the index I proposed in https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478? I believe it is helpful if not to all cases, but to majority of them.

NikolayS avatar Mar 14 '19 19:03 NikolayS

For MySQL 5.6, adding an index on the failed_at column helped considerably.

gcv avatar Oct 13 '19 17:10 gcv

Many select queries (120 reads) on a table with under 2k entries resulted in everything coming to a crawling halt. Adding an index on delayed_jobs(priority,run_at) improved the situation. I had another index only on the column priority. I had to drop that index because it was preferred by mysql 5.7.

Cervenka avatar Jul 07 '21 06:07 Cervenka

I added an index on delayed_jobs(priority,run_at) and switched to the default_sql strategy, which is pretty stable now with about 1M records.

Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql

jairovm avatar May 20 '22 19:05 jairovm

I added an index on delayed_jobs(priority,run_at) and switched to the default_sql strategy, which is pretty stable now with about 1M records.

Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql

@jairovm With PostgreSQL or MySQL?

toao avatar Sep 29 '22 12:09 toao

It was MySQL, but I ended up building a hybrid between DelayedJob and Sidekiq, wrote a small rake task that grabs jobs from the delayed_jobs table in batches of 1_000, and it then enqueues their ids into Redis, which Sidekiq workers pick and process the jobs using Delayed::Worker, something like this.

namespace :jobs do
  desc 'Enqueue Delayed::Job records'
  task enqueuer: :environment do
    Rails.logger.info "Starting job:enqueuer task"

    loop do
      if Sidekiq::ProcessSet.new.size.positive?
        Delayed::Job.order(:run_at).where(run_at: ..Time.current).limit(1_000).pluck(:id).each do |id|
          Rails.logger.info "Enqueueing DelayedJob::Job: #{id}"
          DelayedJob::EnqueuerJob.perform_later(job_id: id)
        end
      else
        Rails.logger.info "No workers available, skipping delayed_job jobs for now."
      end

      sleep 10
    end
  end
end
class DelayedJob::EnqueuerJob < ActiveJob::Base
  sidekiq_options retry: 5
  queue_as :critical

  def perform(job_id:, **)
    Delayed::Job.transaction do
      return unless (job = Delayed::Job.lock.find_by(id: job_id))

      Delayed::Worker.new.run(job)
    end
  end
end

With that, you can run rake jobs:enqueuer instead of rake jobs:work task, we have about 1.5M records in our delayed_jobs table, and it's working pretty well 👌🏽 ( still a beta version of this, tho )

I hope it helps, @toao 👍🏽

jairovm avatar Oct 11 '22 04:10 jairovm

@jairovm thanks for the feedback!

Meanwhile I developed a similar batch solution, but within delayed job itself. It helped for a short interim period but now dj was fully dropped and replaced with an AMQP based approach. Works much smoother, no polling and scales nicely.

toao avatar Oct 11 '22 05:10 toao

In Postgres, you can significantly speed it up just with 2 trivial steps:

  1. Use SELECT ... FOR UPDATE SKIP LOCKED instead of just SELECT ... FOR UPDATE (the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb)
  2. Add this index:
CREATE INDEX i_delayed_jobs_priority_partial
  ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;

I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.

kad92 avatar Jan 31 '23 07:01 kad92

I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.

Ran into a similar issue recently where large row counts in the delayed_jobs table caused high row lock contention on the db and job processing ground to a halt. The following monkey patch in an initializer file that added the SKIP LOCKED postgresql query functionality and voila our db lock contention vanished 💥 (note we do have the index defined as well)

# typed: false
# frozen_string_literal: true
# https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478
# https://www.postgresql.org/docs/16/sql-select.html ('The Locking Clause' section)
module Delayed
  module Backend
    module ActiveRecord
      class Job < ::ActiveRecord::Base
        def self.reserve_with_scope_using_optimized_postgres(ready_scope, worker, now)
          quoted_name = connection.quote_table_name(table_name)
          subquery    = ready_scope.limit(1).lock("FOR UPDATE SKIP LOCKED").select("id").to_sql
          sql         = "UPDATE #{quoted_name} SET locked_at = ?, locked_by = ? WHERE id IN (#{subquery}) RETURNING *"
          reserved    = find_by_sql([sql, now, worker.name])
          reserved[0]
        end
      end
    end
  end
end

Shout out to @NikolayS for providing the fix that worked for us.

Also other postgres backed queue systems implement the SKIP LOCKED

  • https://github.com/rails/solid_queue/blob/664931da38548e5a3845952fe24b24f5285950ee/app/models/solid_queue/record.rb#L11
  • https://github.com/Betterment/delayed/blob/24b75f56c43e470ecdec2d3afdcc11bb21a6e2e3/app/models/delayed/job.rb#L88

and a good read on why using this feature is a good idea, https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/

camallen avatar Jul 02 '24 04:07 camallen

@camallen / @NikolayS if we could get this SKIP LOCKED reservation strategy plugged into both the postgresql and mysql adapters as an alternate reservation strategy that could make delayed_job a contender again in the rails async job ecosystem!

diclophis avatar Aug 30 '24 16:08 diclophis