pdns icon indicating copy to clipboard operation
pdns copied to clipboard

API: gmysql Deadlock

Open Bodenhaltung opened this issue 5 years ago • 22 comments

  • Program: Authoritative
  • Issue type: Bug report

Short description

API Updates result in a MySQL Deadlock

Environment

  • Operating system: ArchLinux
  • Software version: pdns-4.2.1-1
  • Software source: Operating system repository

Steps to reproduce

  1. Send a few IP changes to the api.

Expected behaviour

No Errors.

Actual behaviour

MySQL Error:

[webserver] 50746175-0898-4b7c-812a-b8244a6c0a81 HTTP ISE for "/api/v1/servers/localhost/zones/domain.tld": Exception: GSQLBackend unable to nullify ordername and update auth for domain.tld for domain_id 1: Could not execute mysql statement: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0: Deadlock found when trying to get lock; try restarting transaction

Version 4.2.0-7 works still fine.

MySQL 8.0.16-7 Percona.

Record Table:

| records | CREATE TABLE `records` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `domain_id` int(11) DEFAULT NULL,
  `name` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `type` varchar(10) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `content` varchar(64000) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `ttl` int(11) DEFAULT NULL,
  `prio` int(11) DEFAULT NULL,
  `disabled` tinyint(1) DEFAULT '0',
  `change_date` int(11) DEFAULT NULL,
  `ordername` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `auth` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `rec_name_index` (`name`),
  KEY `nametype_index` (`name`,`type`),
  KEY `domain_id` (`domain_id`),
  KEY `orderindex` (`ordername`)
) ENGINE=InnoDB AUTO_INCREMENT=9601267 DEFAULT CHARSET=utf8 |

Bodenhaltung avatar Dec 31 '19 20:12 Bodenhaltung

Can you please provide logs when running with these settings:

webserver-loglevel=detailed
loglevel=9
query-logging=yes

pieterlexis avatar Jan 06 '20 10:01 pieterlexis

Sure, but it is a lot (16 minutes = 23MB) so i would suggest to send a download link via email?

Bodenhaltung avatar Jan 08 '20 20:01 Bodenhaltung

I get sometimes a 500er for a PATCH against the api, but no deadlock since a few debug hours, every api PATCH request needs 8 seconds.

Bodenhaltung avatar Jan 13 '20 21:01 Bodenhaltung

And i see a lot of queries for the same id:

journalctl -u pdns -g "Query 140028215011296: update" |wc -l
4988
journalctl -u pdns -g "Query 140028214799328: update" |wc -l
4966

Example:

Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: 257 usec to execute
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: 235 usec to execute
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: 234 usec to execute
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: 259 usec to execute
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:49:50 pdns_server[629766]: Query 140028215011296: 229 usec to execute
[...]
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
Jan 13 22:48:50 pdns_server[629766]: Query 140028214799328: update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0

Bodenhaltung avatar Jan 13 '20 22:01 Bodenhaltung

Can i help to debug this issue?

Bodenhaltung avatar Apr 08 '20 13:04 Bodenhaltung

I have no reason to believe this will help, but can you give 4.3.0 a shot?

Habbie avatar Apr 09 '20 13:04 Habbie

I'm not sure if it is related but, the "KEY rec_name_index (name)" is not part of the original schema and is not necessary since there is also a "KEY nametype_index (name,type)" index.

mind04 avatar Apr 09 '20 13:04 mind04

I have no reason to believe this will help, but can you give 4.3.0 a shot?

I will try but i have to wait for the Arch Repo is updated.

I'm not sure if it is related but, the "KEY rec_name_index (name)" is not part of the original schema and is not necessary since there is also a "KEY nametype_index (name,type)" index.

Thanks for notice this, i have deleted the key but doesn't help but also it does no harm. ;-)

Bodenhaltung avatar Apr 18 '20 21:04 Bodenhaltung

I had the same issue previously when updating to 4.2.1. It was related to Empty Non-Terminals and PowerDNS creating them as record entry in the database when starting on 4.2.1. After some help in the powerdns user mailing list I set default-api-rectify to no which fixed the issue for me.

aft2d avatar Apr 25 '20 01:04 aft2d

Note to ourselves (developers): #8497, backported in 4.2.1 in #8503, causes a mix of di.backend and dk(&B), which is bad, to run on patchZone where it did not before. Also, it is missing from the changelog.

Habbie avatar Apr 25 '20 07:04 Habbie

Thank you @aft2d i set "default-api-rectify=no" and the new version runs without any problems.

Bodenhaltung avatar Apr 25 '20 14:04 Bodenhaltung

This bug has not been fixed.

Habbie avatar Apr 25 '20 19:04 Habbie

My guess is that as every API request updates every ordername in the domain (cf #9064), all records of this domain will be locked, and as a result, it can deadlock often. (Unless the db server actually looks at the result of the transaction, which is equal on most records anyways -- but I would assume this kind of "merge" does not happen.)

nils-wisiol avatar Apr 30 '20 06:04 nils-wisiol

My guess is the same - I need to check if my comment from 5 days ago means we are indeed doing PATCH and rectify in overlapping transactions, where it should just be one transaction together.

Habbie avatar Apr 30 '20 06:04 Habbie

I took a high-level look at the commits there -- I don't think it's related, as the behavior before and after was that all ordernames will be updated on each rectify. (My feeling tells me that @Bodenhaltung is using a large zone.)

nils-wisiol avatar Apr 30 '20 06:04 nils-wisiol

(My feeling tells me that @Bodenhaltung is using a large zone.)

Jepp, but not sure what is "large", in this zone are 32702 records.

Bodenhaltung avatar Apr 30 '20 07:04 Bodenhaltung

I took a high-level look at the commits there -- I don't think it's related, as the behavior before and after was that all ordernames will be updated on each rectify. (My feeling tells me that @Bodenhaltung is using a large zone.)

The difference is that before those two PRs, the rectify code would not run at all for many people. The actual problem is not in those two PRs though :)

Habbie avatar Apr 30 '20 07:04 Habbie

My guess is the same - I need to check if my comment from 5 days ago means we are indeed doing PATCH and rectify in overlapping transactions, where it should just be one transaction together.

I believe the rectify would produce invalid data then, because the rectify TX wouldn't see the newly inserted RRsets (?).

zeha avatar May 01 '20 23:05 zeha

Is it still save to run with "default-api-rectify=no"?

Bodenhaltung avatar Nov 29 '20 11:11 Bodenhaltung

Hello, currently running pdns-auth 4.4.1 from http://repo.powerdns.com/debian buster-auth-44 I wanted to use dnssec so I set default-api-rectify=yes (according to some readings about pdns and dnssec). It took about 90 minutes to insert about 8000 records. And I saw many errors about deadlocks on update that leads me to this issue.

If I set default-api-rectify=no It took 20 seconds !

I don't exactly understand what api-rectify does as it seems (to me) unclear in the docs. Is there any other explanation on the role of that config line ?

by the way: the version naming of .deb package is not good according to debian upgrade policies. 4.4.1-1pdns.buster is considered a downgrade of 4.4.1-1pdns.stretch according to the fact that buster is before stretch if sorted alphabetically. Correct way should be something like : deb9, deb10

sfoutrel avatar Jun 30 '21 15:06 sfoutrel

I realized i didn't paste the deadlock error here.

The MySQL Version is actual 10.5.11-MariaDB.

LATEST DETECTED DEADLOCK                                                                                                                                                                                                                                              [411/4597]
------------------------
2021-07-05 15:54:13 0x7f9778c76640
*** (1) TRANSACTION:
TRANSACTION 2014900, ACTIVE 4 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 2
MySQL thread id 58321, OS thread handle 140288207570496, query id 23222070 localhost pdns Updating
update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 12 page no 471 n bits 496 index nametype_index of table `pdns`.`records` trx id 2014900 lock_mode X locks rec but not gap waiting
Record lock, heap no 159 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 10; hex 646e73686f6d652e6465; asc REDACTED.de;;
 1: len 1; hex 41; asc A;;
 2: len 8; hex 800000000000005c; asc        \;;

And:

*** (2) TRANSACTION:
TRANSACTION 2014899, ACTIVE 6 sec starting index read
mysql tables in use 1, locked 1
362 lock struct(s), heap size 57464, 19989 row lock(s), undo log entries 2
MySQL thread id 58320, OS thread handle 140288543123008, query id 23233657 localhost pdns Updating
update records set ordername=NULL,auth=? where domain_id=? and name=? and disabled=0
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 12 page no 471 n bits 496 index nametype_index of table `pdns`.`records` trx id 2014899 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 18; hex 646a6b726963652e646e73686f6d652e6465; asc subdom1.redacted.de;;
 1: len 1; hex 41; asc A;;
 2: len 8; hex 800000000021fa07; asc      !  ;;

Record lock, heap no 15 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 17; hex 646a72696d612e646e73686f6d652e6465; asc subdom2.redacted.de;;
 1: len 1; hex 41; asc A;;
 2: len 8; hex 80000000000435cf; asc       5 ;;

Record lock, heap no 16 PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 17; hex 646a72696d612e646e73686f6d652e6465; asc subdom3.redacted.de;;
 1: len 4; hex 41414141; asc AAAA;;
 2: len 8; hex 8000000000046d76; asc       mv;;
[...] 

Bodenhaltung avatar Jul 05 '21 14:07 Bodenhaltung

I just stumbled across CommunicatorClass::d_data and d_inprogress, which act as a per-domain mutex for secondary operation. Wondering now if the API should also participate.

zeha avatar Sep 23 '24 07:09 zeha

This problem still occurs for our environment, we are using PowerDNS authoritative version 4.7.4 (will be upgraded soon) and mariaDB version 10.11.2 both services are running seperetly on RHEL 8.4 VMs. For maria we are using a 9 node galera cluster across 3 different DCs.

We were able to reproduce the issue by sending two API queries to the same rrset (in one zone)

For example sending api request to the domain: dns.dev-project.com

Example rrsets:

api.dns.dev-project.com IN A 192.168.172.1
api.dns.dev-project.com IN A 192.168.172.2

Are there any fixes to overcome this issue, when using ACME and creating the challenges we run into this alot. @Habbie

GuyT2002 avatar Jan 06 '25 15:01 GuyT2002