delayed_job
delayed_job copied to clipboard
With 500K jobs in the delayed_jobs table, it gets really slow
With around 500K jobs in the delayed_jobs table, it gets really slow, to the point of being unusable. My slow query log is filled with these:
SET timestamp=1379681371;
UPDATE delayed_jobs
SET locked_at
= '2013-09-20 12:49:20', locked_by
= 'delayed_job host:node1365 pid:20668' WHERE ((run_at <= '2013-09-20 12:49:20' AND (locked_at IS NULL OR locked_at < '2013-09-20 08:49:20') OR locked_by = 'delayed_job host:node1365 pid:20668') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1;
// Time: 130920 8:49:36
// Query_time: 4.683968 Lock_time: 0.000069 Rows_sent: 0 Rows_examined: 1
SET timestamp=1379681376;
DELETE FROM delayed_jobs
WHERE delayed_jobs
.id
= 5233997;
// Time: 130920 8:50:55
// Query_time: 6.367763 Lock_time: 0.000082 Rows_sent: 3 Rows_examined: 595935
SET timestamp=1379681455;
SELECT COUNT(*) AS count_all, priority AS priority FROM delayed_jobs
WHERE (run_at < '2013-09-20 12:50:48' and failed_at is NULL) GROUP BY priority;
I have the following indexes (added some more to speed up some queries that were showing up in the slow query log): mysql> show index from delayed_jobs; +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+ | delayed_jobs | 0 | PRIMARY | 1 | id | A | 628375 | NULL | NULL | | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 1 | priority | A | 16 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_priority | 2 | run_at | A | 24168 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 1 | locked_at | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | delayed_jobs_lock | 2 | locked_by | A | 9 | 100 | NULL | YES | BTREE | | | delayed_jobs | 1 | locked_by | 1 | locked_by | A | 9 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | failed_at | 1 | failed_at | A | 97 | NULL | NULL | YES | BTREE | | | delayed_jobs | 1 | run_at | 1 | run_at | A | 13660 | NULL | NULL | YES | BTREE | | +--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
But still, it is extremely slow.
In case someone is looking for a quickfix to clear the queue, here is how we did it: http://aawaara.com/post/61798959791/break-it-down
Those were time sensitive jobs, and we needed to get them out of the queue.
@amitsaxena have you considered using the priority column? @banderous DelayedJob has many advantages over other worker solutions for ruby. One of the things I like least about it is the interaction between running jobs and queueing jobs. Sounds like this would be an upgrade in this dimension. I'd vote for it.
@mattbornski yes I use the priority column - it's just that there are so many database queries by the workers (for polling the queue, updating timestamps, host, etc.), that mysql becomes a bottleneck.
You might also try the delayed_job_redis (or one of the other backends).
@inspire22 we have moved to sidekiq now.
Can we learn something from que?
Que only works with postgresql. So nope.
I am suffering this issue now:
-
Rails 4.2.6
-
DelayedJob 4.1.2
-
Workers: 4
select count(*) from delayed_jobs; // -> 279105
SQL (3674.6ms) UPDATE
delayed_jobs
SETdelayed_jobs
.locked_at
= '2016-06-06 08:16:47',delayed_jobs
.locked_by
= 'delayed_job host:XXX pid:32370' WHERE ((run_at <= '2016-06-06 08:16:47.546325' AND (locked_at IS NULL OR locked_at < '2016-06-06 04:16:47.546367') OR locked_by = 'delayed_job host:XXX pid:32370') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1
I have tried to add more worker but it just make the problem worst
You can try Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql
Using that can effect performance both ways, but for some people it helps.
Also is that 280k active jobs or is that mostly old failed jobs?
If all those are active what is your average job runtime? DJ is not the ideal candidate if you are running massive numbers of very fast jobs.
If they are old failed jobs, do you really still need them?
Also is that 280k active jobs or is that mostly old failed jobs?
All active
If all those are active what is your average job runtime?
Less than a second, but enough to want to put them out of the request.
DJ is not the ideal candidate if you are running massive numbers of very fast jobs.
Good to know
Thanks @albus522
SO thread: http://stackoverflow.com/q/40360455/316700
On MySQL at least, this appears to be due to the LIMIT 1
in several UPDATE
queries. See this blog post for a bit more explanation.
I dropped my delayed_job table from ~220k to ~9k rows (lots of failures due to an application bug) and reduced MySQL CPU usage from 80% to 15%.
In Postgres, you can significantly speed it up just with 2 trivial steps:
-
Use
SELECT ... FOR UPDATE SKIP LOCKED
instead of justSELECT ... FOR UPDATE
(the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb) -
Add this index:
CREATE INDEX i_delayed_jobs_priority_partial
ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;
SKIP LOCKED
was introduced in postgres 9.5, so that feature requires a relatively new postgres install and would require us to make our code even more complicated with db version detection so things don't break for some uesrs.
Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far. They have generally worked for a users specific case but not all and this index is very specific to postgres.
@albus522 thank you for your response, but it was not a change request, it was advice for those who uses DJ with Postgres (I hope with 9.6 or at least 9.5, otherwise one more advice -- upgrade!).
Without those two improvements, you will have big performance issues if you need to process millions of jobs per day.
I considered to create a pull request but changed my mind when saw how many of PRs are left unprocessed for years in this repo.
Indexing is definitely the way to go. Here's what ours looks like:
add_index "delayed_jobs", ["deleted_at", "locked_at", "failed_at", "queue", "run_at", "locked_by"], name: "deleted_at_locked_at_failed_at_queue_run_at_locked_by", using: :btree
With about 200,000 rows, our queries went from about 500ms to 0.7ms, a 714x improvement.
@joshuapinter that's a huge composite index and I doubt if most of the fields there are really adding any speed to your queries.
It's been quite long since I moved away from delayed_job, but try having an index on just the most selective column (or may be two if one is not good enough) in the query and that should give you similar performance gains. Once you are down from say a million rows to 100 (using a selective column index), the extra columns in index don't really add much value. It just makes your write costly.
Looking at the queries above (unless they have changed in newer versions), probably multiple single column indexes make more sense for this than one huge multi column composite index.
@amitsaxena You're right. I'm gonna run some quick tests there and see where I can stop and still achieve the same performance improvements.
@amitsaxena Ha! You're right. I got the same performance improvement just by using a single deleted_at
column in the index, which makes sense because that filters things down to single digit records instead of 200,000+.
Thanks for the tip and reminder!
Glad that I could be of help! :)
@albus522 isn't it time to reconsider it?
SKIP LOCKED was introduced in postgres 9.5, so that feature requires a relatively new postgres install
Less than in one year, 9.5 will become the oldest community-supported version of Postgres, see https://www.postgresql.org/support/versioning/. Already now the most versions which are supported by PGDG have SKIP LOCKED
feature. And MySQL also has it since version 8.
Indices are probably something we need to revisit, but results have not been consistent for any suggested index update so far.
Have you checked the index I proposed in https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478? I believe it is helpful if not to all cases, but to majority of them.
For MySQL 5.6, adding an index on the failed_at
column helped considerably.
Many select queries (120 reads) on a table with under 2k entries resulted in everything coming to a crawling halt.
Adding an index on delayed_jobs(priority,run_at)
improved the situation.
I had another index only on the column priority
. I had to drop that index because it was preferred by mysql 5.7.
I added an index on delayed_jobs(priority,run_at)
and switched to the default_sql
strategy, which is pretty stable now with about 1M records.
Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql
I added an index on
delayed_jobs(priority,run_at)
and switched to thedefault_sql
strategy, which is pretty stable now with about 1M records.Delayed::Backend::ActiveRecord.configuration.reserve_sql_strategy = :default_sql
@jairovm With PostgreSQL or MySQL?
It was MySQL, but I ended up building a hybrid between DelayedJob and Sidekiq, wrote a small rake task that grabs jobs from the delayed_jobs
table in batches of 1_000, and it then enqueues their ids into Redis, which Sidekiq workers pick and process the jobs using Delayed::Worker,
something like this.
namespace :jobs do
desc 'Enqueue Delayed::Job records'
task enqueuer: :environment do
Rails.logger.info "Starting job:enqueuer task"
loop do
if Sidekiq::ProcessSet.new.size.positive?
Delayed::Job.order(:run_at).where(run_at: ..Time.current).limit(1_000).pluck(:id).each do |id|
Rails.logger.info "Enqueueing DelayedJob::Job: #{id}"
DelayedJob::EnqueuerJob.perform_later(job_id: id)
end
else
Rails.logger.info "No workers available, skipping delayed_job jobs for now."
end
sleep 10
end
end
end
class DelayedJob::EnqueuerJob < ActiveJob::Base
sidekiq_options retry: 5
queue_as :critical
def perform(job_id:, **)
Delayed::Job.transaction do
return unless (job = Delayed::Job.lock.find_by(id: job_id))
Delayed::Worker.new.run(job)
end
end
end
With that, you can run rake jobs:enqueuer
instead of rake jobs:work
task, we have about 1.5M records in our delayed_jobs
table, and it's working pretty well 👌🏽 ( still a beta version of this, tho )
I hope it helps, @toao 👍🏽
@jairovm thanks for the feedback!
Meanwhile I developed a similar batch solution, but within delayed job itself. It helped for a short interim period but now dj was fully dropped and replaced with an AMQP based approach. Works much smoother, no polling and scales nicely.
In Postgres, you can significantly speed it up just with 2 trivial steps:
- Use
SELECT ... FOR UPDATE SKIP LOCKED
instead of justSELECT ... FOR UPDATE
(the code is here https://github.com/collectiveidea/delayed_job_active_record/blob/master/lib/delayed/backend/active_record.rb)- Add this index:
CREATE INDEX i_delayed_jobs_priority_partial ON delayed_jobs USING btree (priority, run_at, queue) WHERE failed_at IS NULL;
I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.
I've added the indices to the delayed_job table and I'm already seeing significant improvement! Does anyone know how to do the modifications for the first step mentioned regarding using 'SELECT ... FOR UPDATE SKIP LOCKED' ? I assume it should be done in the delayed_job initializer file in rails.
Ran into a similar issue recently where large row counts in the delayed_jobs
table caused high row lock contention on the db and job processing ground to a halt. The following monkey patch in an initializer file that added the SKIP LOCKED
postgresql query functionality and voila our db lock contention vanished 💥 (note we do have the index defined as well)
# typed: false
# frozen_string_literal: true
# https://github.com/collectiveidea/delayed_job/issues/581#issuecomment-289920478
# https://www.postgresql.org/docs/16/sql-select.html ('The Locking Clause' section)
module Delayed
module Backend
module ActiveRecord
class Job < ::ActiveRecord::Base
def self.reserve_with_scope_using_optimized_postgres(ready_scope, worker, now)
quoted_name = connection.quote_table_name(table_name)
subquery = ready_scope.limit(1).lock("FOR UPDATE SKIP LOCKED").select("id").to_sql
sql = "UPDATE #{quoted_name} SET locked_at = ?, locked_by = ? WHERE id IN (#{subquery}) RETURNING *"
reserved = find_by_sql([sql, now, worker.name])
reserved[0]
end
end
end
end
end
Shout out to @NikolayS for providing the fix that worked for us.
Also other postgres backed queue systems implement the SKIP LOCKED
- https://github.com/rails/solid_queue/blob/664931da38548e5a3845952fe24b24f5285950ee/app/models/solid_queue/record.rb#L11
- https://github.com/Betterment/delayed/blob/24b75f56c43e470ecdec2d3afdcc11bb21a6e2e3/app/models/delayed/job.rb#L88
and a good read on why using this feature is a good idea, https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
@camallen / @NikolayS if we could get this SKIP LOCKED
reservation strategy plugged into both the postgresql and mysql adapters as an alternate reservation strategy that could make delayed_job
a contender again in the rails async job ecosystem!