pdns icon indicating copy to clipboard operation
pdns copied to clipboard

ifportup/ifurlup unexpected behavior

Open sfoutrel opened this issue 2 years ago • 7 comments

  • Program: Authoritative
  • Issue type: Bug report

Short description

ifurlup/ifportup does not use checkresults to answer on first resolv.

Environment

  • Operating system: Debian 11
  • Software version: 4.7.3
  • Software source: deb [arch=amd64] http://repo.powerdns.com/debian bullseye-auth-47 main

Steps to reproduce

  1. add records to existing or new zone
tcp.domain-client.tld.fdns.failoverdns.test     0       IN      LUA     A "ifportup(80, {'100.64.43.123', '100.64.43.124','100.64.43.126'})"
www.domain-client.tld.fdns.failoverdns.test     0       IN      LUA     A "ifurlup('http://www.domain-client.tld/test/', {{'100.64.43.123', '100.64.43.124'},{'100.64.43.126'}})"
  1. add parameters to pdns.conf then restart server expire-delay and check-interval are low value to reproduce multiple times and analyze easily.
lua-health-checks-expire-delay=10
lua-health-checks-interval=2
log-dns-details
log-dns-queries
log-timestamp
loglevel=9
  1. use dig to try first resolv.
dig tcp.domain-client.tld.fdns.failoverdns.test @localhost

; <<>> DiG 9.16.37-Debian <<>> tcp.domain-client.tld.fdns.failoverdns.test @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52470
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;tcp.domain-client.tld.fdns.failoverdns.test. IN	A

;; ANSWER SECTION:
tcp.domain-client.tld.fdns.failoverdns.test. 0 IN A 100.64.43.124

;; Query time: 4 msec
;; SERVER: ::1#53(::1)
;; WHEN: Thu Feb 16 12:44:42 CET 2023
;; MSG SIZE  rcvd: 88

corresponding logs

Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Remote ::1 wants 'tcp.domain-client.tld.fdns.failoverdns.test|A', do = 0, bufsize = 1232 (4096): packetcache MISS
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.126:80 DOWN: connecting to 100.64.43.126:80 failed: Connection refused
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.123:80 UP!
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.124:80 DOWN: connecting to 100.64.43.124:80 failed: Connection refused
  1. a new call to the record gives the UP ip in answer as expected. let pass 10s and try again and you can have a DOWN ip as answer.
  2. with ifurlup you can have the failback ip as first resolv even when one of the first ip is up.

Expected behaviour

using the first check results as answer.

Actual behaviour

seems to answer one of the ips at random without check

Other information

sfoutrel avatar Feb 16 '23 11:02 sfoutrel

This looks like a duplicate of #9529, or am I missing something?

Habbie avatar Feb 16 '23 11:02 Habbie

I just saw your answer on #9529. I though the issue was different (He asked if healthckeck could be done when the LUA test was added).

So on first resolv we can have a wrong answer and each time lua-health-checks-expire-delay value is reached. is it a bug or by design ?

sfoutrel avatar Feb 16 '23 13:02 sfoutrel

is it a bug or by design ?

Like #9529, this currently is by design.

Habbie avatar Feb 16 '23 14:02 Habbie

As I don't think we want to change the design, maybe somethingg like https://github.com/PowerDNS/pdns/pull/12558 could help controling the default behavior ?

chbruyand avatar Feb 16 '23 14:02 chbruyand

SPOILER: I'm not a C dev (at last a junior Python enthusiast). @chbruyand Correct me if I read it badly but your proposal would only push the ips in the "candidates" object instead of "unavailables". I don't understand why would that change anything. @Habbie I read all the code and from my understanding you do not wait for the availability of the results of the async checks to answer (to avoid deadlocks ?). I found another issue (which maybe not be one), performance. Does the lua functions use the cache ? I tried setting the cache-ttl to different values but I always get about 5000 req/s instead of 130K req/s with standard records with sql backend.

sfoutrel avatar Feb 20 '23 08:02 sfoutrel

@sfoutrel The first call to if{port,url}up() will always result in a negative return as checks are performed asynchronously and we use that first call to register the port/url. To rephrase, we consider a port/url to be down by default unless a check has been done (asynchronously). My purposal is to have an option to control that default behavior.

chbruyand avatar Feb 20 '23 09:02 chbruyand

I don't get it. Actual version does already that. It consider every port/url to be down so it replies with any ip of the set (according to selector). From my understanding, your proposal is interesting only for ifurlup if we use a failover set of ips. It would answer of the "primary ips" instead of one of all ips (primary and failover).

sfoutrel avatar Feb 20 '23 09:02 sfoutrel