ifportup/ifurlup unexpected behavior
- Program: Authoritative
- Issue type: Bug report
Short description
ifurlup/ifportup does not use checkresults to answer on first resolv.
Environment
- Operating system: Debian 11
- Software version: 4.7.3
- Software source: deb [arch=amd64] http://repo.powerdns.com/debian bullseye-auth-47 main
Steps to reproduce
- add records to existing or new zone
tcp.domain-client.tld.fdns.failoverdns.test 0 IN LUA A "ifportup(80, {'100.64.43.123', '100.64.43.124','100.64.43.126'})"
www.domain-client.tld.fdns.failoverdns.test 0 IN LUA A "ifurlup('http://www.domain-client.tld/test/', {{'100.64.43.123', '100.64.43.124'},{'100.64.43.126'}})"
- add parameters to pdns.conf then restart server expire-delay and check-interval are low value to reproduce multiple times and analyze easily.
lua-health-checks-expire-delay=10
lua-health-checks-interval=2
log-dns-details
log-dns-queries
log-timestamp
loglevel=9
- use dig to try first resolv.
dig tcp.domain-client.tld.fdns.failoverdns.test @localhost
; <<>> DiG 9.16.37-Debian <<>> tcp.domain-client.tld.fdns.failoverdns.test @localhost
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 52470
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;tcp.domain-client.tld.fdns.failoverdns.test. IN A
;; ANSWER SECTION:
tcp.domain-client.tld.fdns.failoverdns.test. 0 IN A 100.64.43.124
;; Query time: 4 msec
;; SERVER: ::1#53(::1)
;; WHEN: Thu Feb 16 12:44:42 CET 2023
;; MSG SIZE rcvd: 88
corresponding logs
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Remote ::1 wants 'tcp.domain-client.tld.fdns.failoverdns.test|A', do = 0, bufsize = 1232 (4096): packetcache MISS
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.126:80 DOWN: connecting to 100.64.43.126:80 failed: Connection refused
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.123:80 UP!
Feb 16 12:44:42 rd-psoa-01 pdns_server[2281479]: Lua record monitoring declaring TCP/IP 100.64.43.124:80 DOWN: connecting to 100.64.43.124:80 failed: Connection refused
- a new call to the record gives the UP ip in answer as expected. let pass 10s and try again and you can have a DOWN ip as answer.
- with ifurlup you can have the failback ip as first resolv even when one of the first ip is up.
Expected behaviour
using the first check results as answer.
Actual behaviour
seems to answer one of the ips at random without check
Other information
This looks like a duplicate of #9529, or am I missing something?
I just saw your answer on #9529. I though the issue was different (He asked if healthckeck could be done when the LUA test was added).
So on first resolv we can have a wrong answer and each time lua-health-checks-expire-delay value is reached. is it a bug or by design ?
is it a bug or by design ?
Like #9529, this currently is by design.
As I don't think we want to change the design, maybe somethingg like https://github.com/PowerDNS/pdns/pull/12558 could help controling the default behavior ?
SPOILER: I'm not a C dev (at last a junior Python enthusiast). @chbruyand Correct me if I read it badly but your proposal would only push the ips in the "candidates" object instead of "unavailables". I don't understand why would that change anything. @Habbie I read all the code and from my understanding you do not wait for the availability of the results of the async checks to answer (to avoid deadlocks ?). I found another issue (which maybe not be one), performance. Does the lua functions use the cache ? I tried setting the cache-ttl to different values but I always get about 5000 req/s instead of 130K req/s with standard records with sql backend.
@sfoutrel The first call to if{port,url}up() will always result in a negative return as checks are performed asynchronously and we use that first call to register the port/url. To rephrase, we consider a port/url to be down by default unless a check has been done (asynchronously). My purposal is to have an option to control that default behavior.
I don't get it. Actual version does already that. It consider every port/url to be down so it replies with any ip of the set (according to selector). From my understanding, your proposal is interesting only for ifurlup if we use a failover set of ips. It would answer of the "primary ips" instead of one of all ips (primary and failover).