nsd
nsd copied to clipboard
IXFR serial quirks when NSD restarts and only have old zone file on disk
Setup: NSD 4.6 (compiled with --disable-radix-tree and --enable-packed).
server:
# the database to use
# database: "/dns/nsd/nsd.db"
database: ""
# Number of NSD servers to fork.
server-count: 1
# Verbosity level.
verbosity: 2
# The directory for zonefile: files.
zonesdir: "/var/lib/nsd/"
zone:
name: example.net
zonefile: %s
allow-axfr-fallback: yes
allow-notify: 136.244.78.244 NOKEY
request-xfr: 136.244.78.244 NOKEY
max-refresh-time: 300
NSD is Secondary. Serial on Primary is 1, Serial on NSD is 1.
root@klaus-nsd-slave:~# nsd-control zonestatus
zone: example.net
state: ok
served-serial: "1 since 2022-08-17T20:00:26"
commit-serial: "1 since 2022-08-17T20:00:26"
wait: "292 sec between attempts"
Then, write the zonefile:
# nsd-control write
Now, upate the zone on the Primary to serial=2. NSD will IXFR the new zone, but does not write it to disk:
Aug 17 20:05:43 klaus-nsd-slave nsd[15797]: xfrd: zone example.net committed "received update to serial 2 at 2022-08-17T20:05:43 from 136.244.78.244"
Aug 17 20:05:43 klaus-nsd-slave nsd[15810]: zone example.net. received update to serial 2 at 2022-08-17T20:05:43 from 136.244.78.244 of 235 bytes in 3.8e-05 seconds
Aug 17 20:05:43 klaus-nsd-slave nsd[15797]: zone example.net serial 1 is updated to 2
# nsd-control zonestatus
zone: example.net
state: ok
served-serial: "2 since 2022-08-17T20:05:43"
commit-serial: "2 since 2022-08-17T20:05:43"
wait: "281 sec between attempts"
Now restart NSD. NSD will serve the old version with serial=1, as this is the latest version on disk.
Aug 17 20:07:33 klaus-nsd-slave nsd[15861]: nsd started (NSD 4.6.0), pid 15845
Aug 17 20:07:33 klaus-nsd-slave nsd[15845]: zone example.net serial 2 is updated to 1
# nsd-control zonestatus
zone: example.net
state: ok
served-serial: "1 since 2022-08-17T20:07:33"
commit-serial: "2 since 2022-08-17T20:07:33"
wait: "275 sec between attempts"
So, now NSD serves serial=1, but when checking against the Primary, NSD uses serial=2. NSD should use serial=1, as serial=1 is the version that NSD has on disk. (btw: is there somewhere a description what the commit-serial is?)
# nsd-control transfer example.net
--> IXFR request decoded by tcpdump:
IXFR? example.net. ns: example.net. [30s] SOA secdev1-tst.rcode0.net. rcodezero-soa.ipcom.at. 2 10800 3600 604800 30 (120)
^
|
Now, when the serial is increased to 3, NSD requests IXFR with serial=2. Hence, NSD receives the differences from 2-to-3, and then detects that the DIFF can not be applied to the local existing zone with serial=1:
root@klaus-nsd-slave:~# nsd-control zonestatus
zone: example.net
state: ok
served-serial: "1 since 2022-08-17T20:07:33"
commit-serial: "2 since 2022-08-17T20:15:03"
wait: "12998 sec between attempts"
Aug 17 20:20:58 klaus-nsd-slave nsd[15862]: notify for example.net. from 136.244.78.244 serial 3
Aug 17 20:20:58 klaus-nsd-slave nsd[15845]: xfrd: zone example.net committed "received update to serial 3 at 2022-08-17T20:20:58 from 136.244.78.244"
Aug 17 20:20:58 klaus-nsd-slave nsd[15845]: xfrd: zone example.net: soa serial 3 update failed, restarting transfer (notified zone)
root@klaus-nsd-slave:~# nsd-control zonestatus
zone: example.net
state: refreshing
served-serial: "1 since 2022-08-17T20:07:33"
commit-serial: "1 since 2022-08-17T20:07:33"
wait: "26537 sec between attempts"
Proof, that NSD requests IXFR with serial=2, although it only has serial=1 available local:
Aug 17 20:20:58 klaus-bind-master named[17796]: client @0x7eff1c032790 78.141.197.185#35992 (example.net): transfer of 'example.net/IN': IXFR started (serial 2 -> 3)
Now, as NSD "falled back" to served-serial=commit-serial=1, it can recover on the next check:
root@klaus-nsd-slave:~# nsd-control transfer example.net
root@klaus-nsd-slave:~# Aug 17 20:21:23 klaus-nsd-slave nsd[15845]: xfrd: zone example.net committed "received update to serial 3 at 2022-08-17T20:21:23 from 136.244.78.244"
Aug 17 20:21:23 klaus-nsd-slave nsd[15861]: zone example.net. received update to serial 3 at 2022-08-17T20:21:23 from 136.244.78.244 of 325 bytes in 4.8e-05 seconds
Aug 17 20:21:23 klaus-nsd-slave nsd[15845]: zone example.net serial 1 is updated to 3
# nsd-control zonestatus
zone: example.net
state: ok
served-serial: "3 since 2022-08-17T20:21:23"
commit-serial: "3 since 2022-08-17T20:21:23"
wait: "285 sec between attempts"
So. When NSD restarts and the served serial goes backwards due to "only old zone on disk", also the "committed-serial" (what is the commited serial?) should be lowered to the served serial, so that IXFR requests use the proper serial.
This is a very good analysis @klaus-nicat! I think the commit-serial
is from the database days. NSD should not update its value until it has saved the zone to disk in some way. So this certainly looks like a bug.
Indeed excellent analysis @klaus-nicat !
So, now NSD serves serial=1, but when checking against the Primary, NSD uses serial=2. NSD should use serial=1, as serial=1 is the version that NSD has on disk. (btw: is there somewhere a description what the commit-serial is?)
commit-serial is the serial received from transfer and stored on disk in a transfer file to be processed. Sending that along in the IXFR is indeed a bug! Thanks for finding that!
@wtoorop We have a zone which gets updates every few seconds, and we regularly also see "update failed". It seems that happens when multiple IXFRs are applied - see log below. The weird thing is that NSD activates the new zone and then the "update failed" is about on old serial of this zone. Can it be, that this is caused by the same bug? If not, I would open another issue.
13:11:04 nsd[2935091]: zone eu. received update to serial 1083742837 at 2022-08-18T13:11:04 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4331 bytes in 0.000121 seconds
13:11:04 nsd[2935090]: zone eu serial 1083742836 is updated to 1083742837
13:11:09 nsd[341245]: notify for eu. from 83.136.34.4 serial 1083742840
13:11:09 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742840 at 2022-08-18T13:11:09 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:09 nsd[2935091]: zone eu. received update to serial 1083742840 at 2022-08-18T13:11:09 from 83.136.34.4 TSIG verified with key rcode0-distribution of 9261 bytes in 0.000124 seconds
13:11:09 nsd[2935090]: zone eu serial 1083742837 is updated to 1083742840
13:11:19 nsd[341320]: notify for eu. from 83.136.34.4 serial 1083742841
13:11:19 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742841 at 2022-08-18T13:11:19 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:19 nsd[2935091]: zone eu. received update to serial 1083742841 at 2022-08-18T13:11:19 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4386 bytes in 0.000106 seconds
13:11:19 nsd[2935090]: zone eu serial 1083742840 is updated to 1083742841
13:11:20 nsd[2935091]: writing zone eu to file eu.zone
13:11:24 nsd[341655]: notify for eu. from 2a02:850:8::5 serial 1083742842
13:11:24 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742842 at 2022-08-18T13:11:24 from 2a02:850:8::5 TSIG verified with key rcode0-distribution"
13:11:29 nsd[341655]: notify for eu. from 83.136.34.4 serial 1083742843
13:11:30 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742843 at 2022-08-18T13:11:29 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:34 nsd[2935091]: zone eu. received update to serial 1083742842 at 2022-08-18T13:11:24 from 2a02:850:8::5 TSIG verified with key rcode0-distribution of 4390 bytes in 5.2e-05 seconds
13:11:34 nsd[2935091]: zone eu. received update to serial 1083742843 at 2022-08-18T13:11:29 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4306 bytes in 5.6e-05 seconds
13:11:35 nsd[2935090]: zone eu serial 1083742841 is updated to 1083742843
==== HERE UPDATE FAILED =====
13:11:35 nsd[2935090]: xfrd: zone eu: soa serial 1083742842 update failed, restarting transfer (notified zone)
==== HERE UPDATE FAILED =====
13:11:36 nsd[342099]: notify for eu. from 83.136.34.4 serial 1083742844
13:11:37 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742844 at 2022-08-18T13:11:37 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:37 nsd[2935091]: zone eu. received update to serial 1083742844 at 2022-08-18T13:11:37 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4268 bytes in 0.000115 seconds
13:11:37 nsd[2935090]: zone eu serial 1083742843 is updated to 1083742844
13:11:41 nsd[342113]: notify for eu. from 83.136.34.4 serial 1083742845
13:11:41 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742845 at 2022-08-18T13:11:41 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:42 nsd[2935091]: zone eu. received update to serial 1083742845 at 2022-08-18T13:11:41 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4340 bytes in 0.000135 seconds
13:11:42 nsd[2935090]: zone eu serial 1083742844 is updated to 1083742845
Your suggested fix does not work. Starting with serial=1, then write, then increase serial=2, then NSD restart:
a) after NSD restart, now NSD requests IXFR with serial 1 (the last written to disk, that is good), but the served zone is not updated. NSD sill serves serial=1 b) when updating the serial=3, NSD requests IXFR with serial=1, but then fails to apply the IXFR:
Aug 18 14:52:17 klaus-nsd-slave nsd[18051]: notify for example.net. from 136.244.78.244 serial 3
Aug 18 14:52:17 klaus-nsd-slave nsd[18037]: xfrd: zone example.net bad transfer 3 from 136.244.78.244
Can it be, that this is caused by the same bug? If not, I would open another issue.
It does look related. Your situation at 13:11:24 would be a reason to send along the committed serial (because the served serial is not yet 1083742842). So let's try to take it along in this issue for the moment :)
b) when updating the serial=3, NSD requests IXFR with serial=1, but then fails to apply the IXFR:
That's even worse! I'll change the title of the PR! Thanks for trying it out immediately.
I guess current lazy write to disk is not possible with IXFR and zone must be written to disk immediately. OR there need to be two copies of data for IXFR - one on disk, written when zone is written to disk and other version which is in ram and reflects zone status in ram.
Hi! Any news? Can we help getting this fixed? Thanks
Not yet. Sorry @klaus-nicat . @k0ekk0ek do you think you can make a bit of time to have a look?
@wtoorop, @klaus-nicat, I'll have a look.
@klaus-nicat, I've picked this one up. I'm creating a small reproducer first for easy validation and to avoid regressions in the future and then implement a fix. I'll report back ASAP.
@klaus-nicat, I've added a test to verify correct behavior in the future (so it currently fails :sweat_smile:). I'll start working on the actual fix tomorrow.
That's a smart idea ;-) Thanks