delayed_job_active_record Mysql2::Error: Deadlock when attempting to lock a job

We've just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We're now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.

For example:

svc1 Jun  5 12:35:17  Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `delayed_jobs` SET `locked_at` = '2013-06-05 12:35:16', `locked_by` = 'delayed_job.2 host:svc1 pid:20498' WHERE `delayed_jobs`.`queue` IN ('batch') AND ((run_at <= '2013-06-05 12:35:16' AND (locked_at IS NULL OR locked_at < '2013-06-05 08:35:16') OR locked_by = 'delayed_job.2 host:svc1 pid:20498') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1

It would seem that, at the very least, this should be handled.

In case it matters, we're using delayed_job 3.0.4.

Jun 05 '13 13:06 gorism

I get this also with: 4.0.0.beta2

Jun 05 '13 23:06 jamsi

+1 ..

Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently.

Jun 06 '13 18:06 piotrb

+1 delayed_job (3.0.5) delayed_job_active_record (0.4.4)

Jun 24 '13 12:06 philister

Same here. delayed_job (3.0.5) delayed_job_active_record (0.4.4)

Jun 25 '13 14:06 settinghead

+1

delayed_job (3.0.5) activesupport (~> 3.0) delayed_job_active_record (0.4.4) activerecord (>= 2.1.0, < 4) delayed_job (~> 3.0)

mysql(5.5.27)

Jul 04 '13 13:07 cheneveld

Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this. Thanks and regards, Ph.

Jul 11 '13 18:07 philister

@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record

Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now.

Jul 12 '13 22:07 gorism

@gorism Thanks a lot!

Jul 15 '13 06:07 philister

@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time.

Jul 17 '13 15:07 zxiest

edit: see followup at https://github.com/collectiveidea/delayed_job_active_record/issues/63#issuecomment-25468573. the new queries may not be a net gain, even without deadlocks.

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and :boom:. MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

:lock::lock:

Jul 23 '13 23:07 cainlevy

+20

Thanks a bunch cainlevy. This did the trick. Great work.

Jul 24 '13 23:07 cheneveld

Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case)

Jul 26 '13 09:07 philister

Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.

------------------------
LATEST DETECTED DEADLOCK
------------------------
130811  3:19:18
*** (1) TRANSACTION:
TRANSACTION 2607F0BE7C, ACTIVE 0 sec, process no 7937, OS thread id 1364158784 fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 5 lock struct(s), heap size 1248, 21 row lock(s)
MySQL thread id 2166505, query id 64381984645 init
UPDATE `delayed_jobs` SET `locked_at` = '2013-08-11 03:19:18', `locked_by` = 'delayed_job host:app3 pid:26732' WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0BE7C lock_mode X locks rec but not gap waiting
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) TRANSACTION:
TRANSACTION 2607F0C031, ACTIVE 0 sec, process no 7937, OS thread id 1363892544 updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1248, 2 row lock(s), undo log entries 1
MySQL thread id 2166501, query id 64381985155 updating
DELETE FROM `delayed_jobs` WHERE `delayed_jobs`.`id` = 568169197
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 1853 page no 89833 n bits 80 index `PRIMARY` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap
Record lock, heap no 10 PHYSICAL RECORD: n_fields 14; compact format; info bits 32
 0: len 4; hex a1dd92ed; asc     ;;
 1: len 6; hex 002607f0c031; asc  &   1;;
 2: len 7; hex 000000549014ba; asc    T   ;;
 3: len 4; hex 80000000; asc     ;;
 4: len 4; hex 80000000; asc     ;;
 5: len 30; hex 2d2d2d2021727562792f6f626a6563743a44656c617965643a3a50657266; asc --- !ruby/object:Delayed::Perf; (total 1016 bytes);
 6: SQL NULL;
 7: len 8; hex 8000124f11d7316a; asc    O  1j;;
 8: len 8; hex 8000124f11d7316e; asc    O  1n;;
 9: SQL NULL;
 10: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 11: SQL NULL;
 12: len 8; hex 8000124f11d7316a; asc    O  1j;;
 13: len 8; hex 8000124f11d7316a; asc    O  1j;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 1853 page no 33 n bits 648 index `delayed_jobs_priority` of table `production`.`delayed_jobs` trx id 2607F0C031 lock_mode X locks rec but not gap waiting
Record lock, heap no 406 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 80000000; asc     ;;
 1: len 8; hex 8000124f11d7316a; asc    O  1j;;
 2: len 30; hex 64656c617965645f6a6f6220686f73743a61707030207069643a32333832; asc delayed_job host:app0 pid:2382; (total 31 bytes);
 3: len 4; hex a1dd92ed; asc     ;;

*** WE ROLL BACK TRANSACTION (2)

mysql> show indexes from delayed_jobs;
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table        | Non_unique | Key_name              | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| delayed_jobs |          0 | PRIMARY               |            1 | id          | A         |        2873 |     NULL | NULL   |      | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            1 | locked_at   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_locking  |            2 | locked_by   | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            1 | priority    | A         |           1 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            2 | run_at      | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
| delayed_jobs |          1 | delayed_jobs_priority |            3 | locked_by   | A         |        1436 |     NULL | NULL   | YES  | BTREE      |         |
+--------------+------------+-----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
6 rows in set (0.00 sec)

Aug 11 '13 03:08 ngan

What do you get from an explain on a query with the same WHERE and ORDER clauses?

EXPLAIN SELECT * WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1

Aug 11 '13 03:08 cainlevy

mysql> EXPLAIN SELECT * FROM delayed_jobs WHERE ((run_at <= '2013-08-11 03:19:18' AND (locked_at IS NULL OR locked_at < '2013-08-10 23:19:18') OR locked_by = 'delayed_job host:app3 pid:26732') AND failed_at IS NULL) AND (priority <= '0') ORDER BY priority ASC, run_at ASC LIMIT 1;
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys                              | key                   | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
|  1 | SIMPLE      | delayed_jobs | range | delayed_jobs_locking,delayed_jobs_priority | delayed_jobs_priority | 5       | NULL |  589 | Using where |
+----+-------------+--------------+-------+--------------------------------------------+-----------------------+---------+------+------+-------------+
1 row in set (0.00 sec)

Aug 11 '13 03:08 ngan

I don't think we have the delayed_jobs_locking index on our table, but I guess it's not being used? Only other thing I can think of is maybe MySQL's 5.1 query planner isn't taking advantage of the updated delayed_jobs_priority index.

Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more!

Aug 11 '13 04:08 cainlevy

I added delayed_jobs_locking index for other metrics pulling. It would be nice if DJ-AR allowed us to specify the strategy for claiming jobs, because I would use the old strategy and still get new upgraded features.

Aug 11 '13 04:08 ngan

adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table?

Sep 11 '13 11:09 romanlehnert

Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone.

Sep 12 '13 11:09 t-anjan

+1 Thanks @cainlevy, works great.

Sep 12 '13 14:09 ValMilkevich

Is everyone that this is working for on mysql 5.5+?

Sep 12 '13 15:09 ngan

@ngan @cainlevy's fix doesn't help me on 5.5

Sep 13 '13 08:09 michaelglass

@ngan - I am on MySQL 5.5.31 and @cainlevy's fix DOES help me.

Sep 13 '13 08:09 t-anjan

MySQL 5.5.x

If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more.

Sep 13 '13 12:09 cainlevy

I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13

Sep 17 '13 01:09 michaelglass

my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed.

However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other.

Sep 17 '13 01:09 michaelglass

I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded

Sep 18 '13 12:09 zxiest

@zxiest That sounds like a good way to make DB polling more efficient, but if you're still having trouble with deadlocks, it'll probably come back when you run multiple machines.

Sep 18 '13 13:09 cainlevy

@michaelglass I presume you're measuring performance against a previous version of the gem? Are you measuring real-world throughput, or using an empty job?

Sep 18 '13 13:09 cainlevy

@cainlevy yup, this might be an issue, however, if needed in the future, I can achieve mutual exclusivity by processing different queues on different servers at first and if the queues are too large, I could use job_id % servers_number == server_id . I would still use protective locks for safety but avoid locking as much as possible =)

Sep 18 '13 15:09 zxiest