cms icon indicating copy to clipboard operation
cms copied to clipboard

Lock wait timeout exceeded MySQL errors

Open nickdunn opened this issue 4 years ago • 27 comments
trafficstars

Description

Very sporadically I am getting Lock wait timeout exceeded errors from MySQL.

These appear in Craft logs like:

Next yii\db\Exception: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction

and

[error][yii\db\Exception] PDOException: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction in /var/www/craft/vendor/yiisoft/yii2/db/Command.php:1302

Running show engine innodb status shows they are consistently from queries on the structureelements table:

------------------------
LATEST DETECTED DEADLOCK
------------------------
2021-09-30 13:49:41 0x14b8
*** (1) TRANSACTION:
TRANSACTION 6366085, ACTIVE 2 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 22 lock struct(s), heap size 1136, 10 row lock(s), undo log entries 8
MySQL thread id 89160, OS thread handle 5260, query id 7402009 10.2.0.62 cswwwuatdbmysql updating
UPDATE `structureelements` SET `lft`=`lft`+2, `dateUpdated`='2021-09-30 13:49:39' WHERE (`lft` >= 884) AND (`root`=1)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6366085 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** (2) TRANSACTION:
TRANSACTION 6365973, ACTIVE 5 sec inserting
mysql tables in use 1, locked 1
335 lock struct(s), heap size 41168, 2975 row lock(s), undo log entries 2044
MySQL thread id 89190, OS thread handle 5304, query id 7403525 10.2.0.62 cswwwuatdbmysql update
INSERT INTO `structureelements` (`structureId`, `dateCreated`, `dateUpdated`, `uid`, `lft`, `rgt`, `level`) VALUES (20875, '2021-09-30 13:49:41', '2021-09-30 13:49:41', '58e60ef1-6f8b-4a5d-9783-cf378111c29d', 1, 2, 0)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6365973 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 623 page no 20 n bits 1192 index structureelements_root_idx of table `craftcms`.`structureelements` trx id 6365973 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 4; hex 00000001; asc     ;;
 1: len 4; hex 80000001; asc     ;;

*** WE ROLL BACK TRANSACTION (1)

The database server is pretty chunky and performs well elsewhere. The innodb_lock_wait_timeout setting was at the 50 second default. I have just dumped to 100 but this feels like a sticking plaster i.e. telling it to wait a bit longer just in case!

https://github.com/craftcms/cms/issues/5872 suggests a large number of revisions could be the issue. I only have about 6k entries and 4k revisions, while there are 57k rows in structureelements.

I'm conscious StackOverflow is awash with useless threads on these errors and the usual reply is "refactor your code to make sure everything is in one transaction" which doesn't seem too useful.

The offending queries seem fairly innocuous to me. The issue is very sporadic — maybe 10 times in 3 months — and always seemingly when a content editor goes to save an entry.

Reckons?

Additional info

  • Craft version: 3.7.13
  • PHP version: 8.0.1
  • Database driver & version: MySQL 5.7.32 (https://azure.microsoft.com/en-us/services/mysql/)

nickdunn avatar Sep 30 '21 18:09 nickdunn

Hey @nickdunn, do you have SEOmatic installed by chance? If so can you please ensure that you are running its latest version? We have seen similar issues where that ended up being the culprit, and has been fixed in v3.4.14.

brandonkelly avatar Oct 01 '21 21:10 brandonkelly

@brandonkelly I've been experiencing the same on a page recently. It's a "Pruning extra revisions" job that fails: Screenshot 2021-10-02 at 14 03 23

Now we are running the latest version, but we noticed the issue 10 days ago after running 3.7.11 for 2 weeks. Before that we have been running 3.6.17 without any problems. So it might be a change in 3.7 causing this?

Another thing I have noticed is that this is often happening after saving a specific entry. The entry containes a matrix field with a bunch of blocks, but this entry have had it's revisions pruned without a problem earlier.

System info

PHP 7.4.15 Linux 5.4.0-88-generic MySQL 8.0.23 Craft Pro 3.7.14

Plugins

AsyncQueue 2.3.0 Calendar Links 1.0.1 Craft MailChimp 1.0.2 (custom plugin) DigitalOcean Spaces Volume 1.1.3 Element API 2.8.3 Imgix 2.1.0 Minify 1.2.10 Redactor 2.8.8 Super Table 2.6.8 Twigpack 1.2.15

Notes

It's an issue that occured before AsyncQueue was installed.

Timeline

Sun Jul 04 2021 Updated to 3.6.17 Tue Sep 07 2021 Updated to 3.7.11 Fri Sep 24 2021 First time we noticed the problem. Installed Async Queue plugin as a potential solution. Sat Oct 02 2021 Experiencing the same problem. Updated to 3.7.14. Managed to recreate problem by saving entry after update.

Hope this helps. Let me know if you need me to test anything more.

Edit: It's not hard to notice the problem as the servers stales when it happens.

lassemt avatar Oct 02 '21 12:10 lassemt

@lassemt Does the job finish successfully if you restart it? Or does it consistently fail?

brandonkelly avatar Oct 04 '21 22:10 brandonkelly

@lassemt Does the job finish successfully if you restart it? Or does it consistently fail?

@brandonkelly I didn't try restarting the queue as most of the times I had to restart the server. I have the queue.log from when it happened and you see it tries to restart the job itself:

2021-10-01 17:14:33 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [272119] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:15:25 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [272119] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 52.052s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 1059) AND (`rgt` <= 1060)) AND (`root`=1)
2021-10-01 17:15:25 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [277820] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:17:58 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [277820] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 152.207s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)
2021-10-01 17:17:58 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [283346] Pruning extra revisions (attempt: 1, pid: 461734) - Started
2021-10-01 17:22:06 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [283346] Pruning extra revisions (attempt: 1, pid: 461734) - Error (time: 248.894s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)
2021-10-01 17:22:06 [-][150469][-][info][craft\queue\QueueLogBehavior::beforeExec]  [270997] Pruning extra revisions (attempt: 2, pid: 461734) - Started
2021-10-01 17:27:08 [-][150469][-][error][craft\queue\QueueLogBehavior::afterError]  [270997] Pruning extra revisions (attempt: 2, pid: 461734) - Error (time: 301.233s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 955) AND (`rgt` <= 956)) AND (`root`=1)

lassemt avatar Oct 05 '21 10:10 lassemt

Thanks. The first and last errors are 12 minutes apart, so that probably rules out the possibility that the lock was just legitimately tied up by a separate process.

Do you see any other [errors] further up in the logs that look like they might have had something to do with updating structure data? Or do you have anything with a semi-recent timestamp in storage/logs/phperrors.log? Seems like maybe there was a previous request that ended prematurely, and wasn’t able to release its lock on the structure.

brandonkelly avatar Oct 05 '21 17:10 brandonkelly

@brandonkelly that makes sense. Here is batch from queue.log when I tried to force recreating the error by manually saving the entry a couple of times. This was done right after I restarted the server, so the first event you see is the first queued job after the restart, and the last event is from right after I triggered the server restart through Digital Ocean.

2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292891] Updating search indexes (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292891] Updating search indexes (attempt: 1, pid: 1628) - Done (time: 0.056s)
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292894] Updating search indexes (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:20 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292894] Updating search indexes (attempt: 1, pid: 1628) - Done (time: 0.050s)
2021-10-02 13:42:21 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292893] Pruning extra revisions (attempt: 1, pid: 1628) - Started
2021-10-02 13:42:22 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292893] Pruning extra revisions (attempt: 1, pid: 1628) - Done (time: 1.023s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292896] Updating element slugs and URIs (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292896] Updating element slugs and URIs (attempt: 1, pid: 1665) - Done (time: 0.039s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292895] Updating search indexes (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292895] Updating search indexes (attempt: 1, pid: 1665) - Done (time: 0.052s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292898] Updating search indexes (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292898] Updating search indexes (attempt: 1, pid: 1665) - Done (time: 0.048s)
2021-10-02 13:42:48 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292897] Pruning extra revisions (attempt: 1, pid: 1665) - Started
2021-10-02 13:42:49 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292897] Pruning extra revisions (attempt: 1, pid: 1665) - Done (time: 0.788s)
2021-10-02 13:43:06 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292900] Updating element slugs and URIs (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292900] Updating element slugs and URIs (attempt: 1, pid: 1727) - Done (time: 0.037s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292899] Updating search indexes (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292899] Updating search indexes (attempt: 1, pid: 1727) - Done (time: 0.046s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292902] Updating search indexes (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292902] Updating search indexes (attempt: 1, pid: 1727) - Done (time: 0.049s)
2021-10-02 13:43:07 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292901] Pruning extra revisions (attempt: 1, pid: 1727) - Started
2021-10-02 13:43:08 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292901] Pruning extra revisions (attempt: 1, pid: 1727) - Done (time: 0.654s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292904] Updating element slugs and URIs (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292904] Updating element slugs and URIs (attempt: 1, pid: 1755) - Done (time: 0.036s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292903] Updating search indexes (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292903] Updating search indexes (attempt: 1, pid: 1755) - Done (time: 0.048s)
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292906] Updating search indexes (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:24 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292906] Updating search indexes (attempt: 1, pid: 1755) - Done (time: 0.048s)
2021-10-02 13:43:25 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292905] Pruning extra revisions (attempt: 1, pid: 1755) - Started
2021-10-02 13:43:25 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292905] Pruning extra revisions (attempt: 1, pid: 1755) - Done (time: 0.749s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292908] Updating element slugs and URIs (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292908] Updating element slugs and URIs (attempt: 1, pid: 1784) - Done (time: 0.040s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292907] Updating search indexes (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292907] Updating search indexes (attempt: 1, pid: 1784) - Done (time: 0.049s)
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292910] Updating search indexes (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:43 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292910] Updating search indexes (attempt: 1, pid: 1784) - Done (time: 0.050s)
2021-10-02 13:43:44 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292909] Pruning extra revisions (attempt: 1, pid: 1784) - Started
2021-10-02 13:43:44 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292909] Pruning extra revisions (attempt: 1, pid: 1784) - Done (time: 0.780s)
2021-10-02 13:44:29 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292912] Updating element slugs and URIs (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292912] Updating element slugs and URIs (attempt: 1, pid: 1894) - Done (time: 0.440s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292916] Updating element slugs and URIs (attempt: 1, pid: 1897) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292916] Updating element slugs and URIs (attempt: 1, pid: 1897) - Done (time: 0.091s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292911] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292911] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.114s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292915] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292915] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.204s)
2021-10-02 13:44:30 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292914] Updating search indexes (attempt: 1, pid: 1897) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292914] Updating search indexes (attempt: 1, pid: 1897) - Done (time: 0.351s)
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292918] Updating search indexes (attempt: 1, pid: 1894) - Started
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [292918] Updating search indexes (attempt: 1, pid: 1894) - Done (time: 0.105s)
2021-10-02 13:44:31 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292913] Pruning extra revisions (attempt: 1, pid: 1897) - Started
2021-10-02 13:45:24 [-][-][-][error][craft\queue\QueueLogBehavior::afterError]  [292913] Pruning extra revisions (attempt: 1, pid: 1897) - Error (time: 53.096s): SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction
The SQL being executed was: DELETE FROM `craft_structureelements` WHERE ((`lft` >= 927) AND (`rgt` <= 928)) AND (`root`=1)
2021-10-02 13:44:32 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [292917] Pruning extra revisions (attempt: 1, pid: 1894) - Started
2021-10-02 13:47:45 [-][-][-][error][craft\queue\QueueLogBehavior::afterError]  [292917] Pruning extra revisions (attempt: 1, pid: 1894) - Error (time: 193.661s): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

The only error in phperrors.log close to this timestamp is at [02-Oct-2021 13:47:45 Europe/Oslo] which is reelated to the shutdown: "Communication link failure: 1053 Server shutdown in progress".

I was watching the queue in the admin when the above was happening and took screenshot of the 2 "Pruning extra revisions" jobs that "got stuck". The images are in the order they were in the queue overview in the admin: Screenshot 2021-10-05 at 20 18 25 And the second: Screenshot 2021-10-05 at 20 18 33

I'm not sure if it matters, but notice that the 2 jobs are related to the same SourceID and that the one with the ID 292913 that was pushed first failed right after the one with the ID 292917 was pushed to the queue. I believe it was right after the second one was pushed to the queue the server started to run slow.

lassemt avatar Oct 05 '21 18:10 lassemt

@lassemt Can you try updating to Craft 3.7.15, restart the DB server, and see if you can still reproduce this? Sounds like maybe 8e780c723fd78a0054d4f5c15897157e2e580c99 might help.

brandonkelly avatar Oct 05 '21 20:10 brandonkelly

@brandonkelly I updated and restarted the server now. The same issue is still there and the logs are the same. But I noticed that it only happens when there are 2 "Pruning extra revisions" jobs for the same entry. None of the jobs are completed before the server starts to slow down, and it is always the first one pushed to queue that get the Lock Wait timeout exceeeded error message. The last one of the 2 jobs rarely shows up in the queue.log.

Screenshot 2021-10-06 at 00 59 54

If there is only one "Pruning extra revisions" job for the entry, the job finishes as expected:

2021-10-06 00:46:36 [-][-][-][info][craft\queue\QueueLogBehavior::beforeExec]  [294411] Pruning extra revisions (attempt: 1, pid: 222642) - Started
2021-10-06 00:46:37 [-][-][-][info][craft\queue\QueueLogBehavior::afterExec]  [294411] Pruning extra revisions (attempt: 1, pid: 222642) - Done (time: 0.915s)

I had to open the entry editor in 2 tabs and save them almost simultaneously to get 2 jobs going.

lassemt avatar Oct 05 '21 23:10 lassemt

Hey @nickdunn, do you have SEOmatic installed by chance? If so can you please ensure that you are running its latest version? We have seen similar issues where that ended up being the culprit, and has been fixed in v3.4.14.

We do, and have recently updated to latest version. Not seen the error since... but then again we've only seen the error a handful of times in 6 months so it's sporadic. I don't think this was the culprit since we've had regenerateSitemapsAutomatically set to false for a long time so SEOmatic isn't doing anything queue related — so I think different to https://github.com/nystudio107/craft-seomatic/issues/989.

nickdunn avatar Oct 06 '21 09:10 nickdunn

@lassemt Do you have multiple queue runners?

brandonkelly avatar Oct 06 '21 12:10 brandonkelly

I'd like to chip in with a similar experience, although it's not just a single resource/element that it's happening on. We do have a faily chunky install, with 9 sites, ~50k rows in elements and ~8k in structureelements. Max revisions is set to 5 in config and although we're using Redis for caching, it's not being used for the queue and the default driver is being used here. I'd consider offloading to Redis if that would help or is recommended anyway. Queue runner in the offending environment is the default (triggered by CP rather than queue/run in a cron job).

MariaDB pegged a single core (out of 4) once this behaviour started and had a huge impact on DB performance such that the CP was unususable (couldn't save entries reliably). Killing the DB process was the only way to resolve things. I have since run prune-revisions via the CLI and it ripped through pretty quickly.

queue.log attached for info, not that it reports anything that hasn't been seen already (what's with the order of the entries being messed up though? why aren't they in ascending time order?)

Craft 3.6.18 Plugins: "bertoost/craft-mailjet": "1.2.3", "born05/craft-imagehotspots": "1.2.0", "chasegiunta/scss": "1.0.0", "craftcms/feed-me": "4.3.6", "craftcms/redactor": "2.8.8", "endroid/qr-code": "^3.9", "ether/simplemap": "3.9.1", "mmikkel/reasons": "2.2.5", "nthmedia/entry-gps-coordinates": "1.3.1", "verbb/field-manager": "2.2.3", "verbb/formie": "1.4.12", "verbb/super-table": "2.6.8"

queue.log: queue-snip.log

wallacio avatar Oct 07 '21 08:10 wallacio

@lassemt Do you have multiple queue runners?

@brandonkelly I don't think so (I'm not too familiar with queues in PHP). As far as I know, I'm only running the default queue driver. I disabled the ostark/craft-async-queue plugin before updating and running the previous tests.

lassemt avatar Oct 07 '21 09:10 lassemt

Using Async Queue here, default 2 concurrent processes.

nickdunn avatar Oct 07 '21 10:10 nickdunn

@nickdunn I have a feeling multiple queue runners has a tendency to exacerbate these sorts of issues.

brandonkelly avatar Oct 07 '21 14:10 brandonkelly

Yeah, use 1 queue processor and you'll be fine. We learned this the hard way too, hence: https://twitter.com/sjcallender/status/1433238613039407105

sjcallender avatar Oct 11 '21 18:10 sjcallender

Ok, we've upgraded Craft, Neo and SEOmatic to latest and have dropped Async Queue to one queue processor so will see if this fixes it.

nickdunn avatar Oct 12 '21 11:10 nickdunn

Same here!

The entire site from our client went down after a few “Internal server errors” messages. For me as the admin I could test this once and after it the website + CP crashed down. The YII toolbar gave an error: Lock wait timeout exceeded; try restarting transaction

There where 2 failed tasks in the queue before it threw me out that stated something about Pruning extra revisions:

2021-10-25 12:28:34 [-][1440][-][info][craft\queue\QueueLogBehavior::beforeExec]  [36509] Pruning extra revisions (attempt: 1, pid: 1930471) - Started
2021-10-25 15:09:29 [-][1440][-][error][craft\queue\QueueLogBehavior::afterError]  [36509] Pruning extra revisions (attempt: 1, pid: 1930471) - Error (time: 9655.015s): SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
2021-10-25 15:09:29 [-][1440][-][error][PDOException] PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in /vendor/yiisoft/yii2/db/Transaction.php:203
Stack trace:
#0 /vendor/yiisoft/yii2/db/Transaction.php(203): PDO->rollBack()
#1 /vendor/craftcms/cms/src/services/Elements.php(1541): yii\db\Transaction->rollBack()
#2 /vendor/craftcms/cms/src/queue/jobs/PruneRevisions.php(75): craft\services\Elements->deleteElement()
#3 /vendor/yiisoft/yii2-queue/src/Queue.php(246): craft\queue\jobs\PruneRevisions->execute()
#4 /vendor/yiisoft/yii2-queue/src/cli/Queue.php(147): yii\queue\Queue->handleMessage()
#5 /vendor/craftcms/cms/src/queue/Queue.php(131): yii\queue\cli\Queue->handleMessage()
#6 [internal function]: craft\queue\Queue->craft\queue\{closure}()
#7 /vendor/yiisoft/yii2-queue/src/cli/Queue.php(117): call_user_func()
#8 /vendor/craftcms/cms/src/queue/Queue.php(140): yii\queue\cli\Queue->runWorker()
#9 /vendor/craftcms/cms/src/controllers/QueueController.php(84): craft\queue\Queue->run()
#10 [internal function]: craft\controllers\QueueController->actionRun()
#11 /vendor/yiisoft/yii2/base/InlineAction.php(57): call_user_func_array()
#12 /vendor/yiisoft/yii2/base/Controller.php(181): yii\base\InlineAction->runWithParams()
#13 /vendor/craftcms/cms/src/web/Controller.php(190): yii\base\Controller->runAction()
#14 /vendor/yiisoft/yii2/base/Module.php(534): craft\web\Controller->runAction()
#15 /vendor/craftcms/cms/src/web/Application.php(274): yii\base\Module->runAction()
#16 /vendor/craftcms/cms/src/web/Application.php(577): craft\web\Application->runAction()
#17 /vendor/craftcms/cms/src/web/Application.php(253): craft\web\Application->_processActionRequest()
#18 /vendor/yiisoft/yii2/base/Application.php(392): craft\web\Application->handleRequest()
#19 /public_html/index.php(22): yii\base\Application->run()
#20 {main}

We found this query task on our server. We think this was our little devil that crashed everything. (We manually killed it because there was no other option left) | 11450319 | DBNAME | localhost | DBNAME | Query | 0 | Updating | UPDATE structureelements SET rgt=rgt-2, dateUpdated='2021-10-25 13:08:37' WHERE (rgt >= 512) AND (root=123) | 0.000 |

Also these: | 11469436 | DBNAME | localhost | DBNAME | Query | 1874 | Waiting for table metadata lock | UPDATE users SET lastLoginDate='2021-10-25 12:36:15', dateUpdated='2021-10-25 12:36:15' WHERE id=1 | 0.000 | | 11469471 | DBNAME | localhost | DBNAME | Query | 1871 | Waiting for table metadata lock | UPDATE users SET lastLoginDate='2021-10-25 12:36:18', dateUpdated='2021-10-25 12:36:18' WHERE id=1 | 0.000 | | 11466868 | DBNAME | localhost | DBNAME | Query | 2836 | Waiting for table metadata lock | LOCK TABLES assets

Maybe you guys can shed some light on this issue so we can maybe prevent this in the future.

  • PHP version: 7.4.23
  • Craft CMS Pro: 3.6.11.2
  • MariaDB: 10.3.31

kevinmu17 avatar Oct 25 '21 14:10 kevinmu17

@kevinmu17 - This is the exact issue that forced us to use just one queue processor. We've been fine ever since.

sjcallender avatar Oct 25 '21 15:10 sjcallender

@kevinmu17 - This is the exact issue that forced us to use just one queue processor. We've been fine ever since.

We didn't install any, or does craft out of the box use more than 1? :)

kevinmu17 avatar Oct 25 '21 15:10 kevinmu17

We didn't install any, or does craft out of the box use more than 1? :)

By default, the queue is run when you visit the the control panel. I'm not sure if two or more can ever be initiated this way. @brandonkelly?

I'd recommended setting 'runQueueAutomatically' => false in config/general.php and then enabling a single alternative queue worker as noted in the docs: https://craftcms.com/docs/3.x/config/config-settings.html#runqueueautomatically

sjcallender avatar Oct 25 '21 16:10 sjcallender

@kevinmu17 I'd suggest updating to the latest 3.7 release as a starting point and see if the behavior changes. There have been improvements made around transactions and locking that you're missing.

angrybrad avatar Oct 25 '21 16:10 angrybrad

Could https://github.com/craftcms/cms/issues/9977 be a potential fix for this?

Edit: unlikely — locally I've just run utils/repair/section-structure against my main "Page" structure and the number of structureelements rows has remained at around 55k.

nickdunn avatar Nov 03 '21 08:11 nickdunn

@nickdunn You updated to 3.7.19 first right?

brandonkelly avatar Nov 03 '21 23:11 brandonkelly

Ha, yes I did! The utils/repair/section-structure CLI task ran and output the tree structure of my entries.

But I'm thinking maybe 55K structureelements rows is par for the course for a few hundred entries. I have 3 Structure sections all totalling ~250 entries.

I must admit I don't fully understand what the lower level structures and structureelements are, but I have other plugins that use things that are "ordered" (Neo blocks, Navigation etc) so maybe this adds rows to the table and 55K rows is expected. There are 18K rows in structures, 8K having dateDeleted dates set.

nickdunn avatar Nov 04 '21 10:11 nickdunn

Ah… it’s due to Neo. Each individual Neo field/element combo is going to add its own additional structure.

brandonkelly avatar Nov 04 '21 14:11 brandonkelly

We just update to 3.7.10.1 since we had similar problems with a version < 3.7.19, but the problem still persists., although now it does not create that massive db load anymore which can finally only be resolved by restarting it.

Here are some facts:

  • It is related to "pruning extra revisions tasks", started a day before making no progress at all.
  • The database is executing the same queries again and again:
SELECT *
FROM `structureelements`
WHERE ((`lft` > 44) AND (`rgt` < 45) AND (`level` <= 2)) AND (`root`=1919849)
ORDER BY `lft`

UPDATE `structureelements` SET `lft`=`lft`-2, `dateUpdated`='2022-01-28 08:05:27' WHERE (`lft` >= 46) AND (`root`=1919849)

DELETE FROM `structureelements` WHERE ((`lft` >= 44) AND (`rgt` <= 45)) AND (`root`=1919849)
  • We have failed "prune extra revision tasks" with the following error:
SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction The SQL being executed was: DELETE FROM `structureelements` WHERE ((`lft` >= 41) AND (`rgt` <= 42)) AND (`root`=1919849)
  • I have 1 queue runner via cronjob using flock to prevent concurrent runs (runQueueAutomatically disabled). The queue runner shows the "prune" job being executed hanging at 33% (no changes for 24 hours). Although this job is active/reserved, other tasks are still handled (updating search indexes primarily). My first guess would be I have multiple runners, which I can not confirm. I mentioned this in a support ticket before: The queue sometimes looses track of its running tasks and does not close them. It has something todo with database timeouts and long running queue tasks.
  • Next I emptied the queue and made sure no queue running happens (ps ax grepping for the queue tasks - nothing): But the database queries mentioned above are still happening, again and again. I can not stop them, so they must be self triggering or something.

I know that some of my findings above sound rather confused, but this is what I am seeing.

I need to mention that we have Neo installed in version 2.8.16 and we need to update that. will happen soon and I will keep you posted.

hiasl avatar Jan 28 '22 08:01 hiasl

@hiasl We have the same exact issue, We've tried rebooting the cluster DB and still the issue persist, did you do anything to resolve your issue?

I don't know if this is the case for others, but we use the Scout plugin and I can reproduce "Could not acquire a lock" and "Unable to acquire a lock" issues by attempting something similar to #9999.

Because we have many relationships throughout one of our sites save and deletion events can take a very long time. If I do the following:

  1. Find an entry commonly referenced in an Algolia index
  2. Make a change to that entry (but don't save)
  3. Make a draft to that same entry
  4. Make a different change to the draft
  5. Try to apply the draft and run the save event in step 2 at the same time

Two long-running save events (and deletion events because we have maxRevisions => 5 in our config) will run for both the primary element and the application of the draft—taking a very long time. This effectively always results in a lock.

If this is the case for you, consider disabling indexRelations in your Scout config. This will prevent Scout from trying to fetch every entry across your site related to the one you updated so changes are synced to your Algolia index.

I've also submitted a discussion and pull request (studioespresso/craft-scout#248) about moving the Scout events into the queue to speed this process up significantly and hopefully give others the benefit of still using indexRelations and keeping things fast/avoiding possible locks.

aaronbushnell avatar Oct 23 '22 14:10 aaronbushnell

@metheeruetaichetcharoen-toast sorry, too long ago, I don't remember how it ended...

hiasl avatar Oct 23 '22 18:10 hiasl

@aaronbushnell DB locks from saving an entry at the same time should be avoided as both of those actions will acquire a Mutex lock for the entry, so it shouldn’t be possible to save the same entry twice at the same time.

If that’s not working as expected, my guess is that you’re running Craft on a load-balanced environment, and you haven’t configured Craft to share the same mutex source across each of the servers. See the Mutex Locks section of our KB article on configuring Craft for load-balanced environments.

brandonkelly avatar Oct 24 '22 12:10 brandonkelly