mdns-sd icon indicating copy to clipboard operation
mdns-sd copied to clipboard

Having a browse running for a longer time, services with a ttl of 60 are sort of flapping

Open hrzlgnm opened this issue 1 year ago • 12 comments
trafficstars

Having the following publishing script, that set all TTL to 60

  • https://gitlab.com/hrzlgnm/m/-/blob/master/zerodings/publish3.py?ref_type=heads

And running the query example against it, it seems to be sort of flapping:

What i can observe using `cargo run --example=query _test._tcp

At 232.17µs : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 52.022944ms : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 52.243274ms: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 1.002199044s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 3.00448952s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 7.013818528s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 15.018062584s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 31.023905263s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 63.026812518s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 127.034710079s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 187.103158981s : ServiceRemoved("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 255.039446937s : SearchStarted("_test._tcp.local. on addrs [192.168.30.128, 2003:e8:bf3e:3f00:9caf:6c11:f93d:47bf, fe80::4ae2:8b49:d234:7a45, fe80::f3fc:8e58:38e3:ff2b, 192.168.178.84]")
At 255.113361982s : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 255.113391995s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 315.113645282s : ServiceRemoved("_test._tcp.local.", "void-void-vm._test._tcp.local.")

After 187 seconds, the service is removed, despite still being published. And later found again after 255s and removed again after 315s.

hrzlgnm avatar May 11 '24 22:05 hrzlgnm

I cannot observe the service remove and added behavior when I run avahi-browse -pr _test._tcp.

hrzlgnm avatar May 11 '24 22:05 hrzlgnm

Is there any other instances of the same service type on any of other interfaces? Particularly in the network of 192.168.30.128 ?

And, could you try to turn on the logging using RUST_LOG=debug when running the querier example, to get the logs before the ServiceRemoved event?

keepsimple1 avatar May 12 '24 01:05 keepsimple1

I was able to reproduce this issue. And I think the root cause is: we are not refreshing PTR DNS records, hence after a while PTR records expired and we remove the services.

The reason that the problem did not show up earlier is that we have a timer with exponential backoff (1s, 2s, 4s, ...), and these repeat queries would update the records. But later the gap is longer than the TTL 60 seconds, then the PTR record expires without proper refreshing.

I've opened PR #217 to fix this issue. Please check if that helps. I will try to come up with some test cases later.

Thanks for reporting this issue!

keepsimple1 avatar May 12 '24 04:05 keepsimple1

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

keepsimple1 avatar May 12 '24 05:05 keepsimple1

Is there any other instances of the same service type on any of other interfaces? Particularly in the network of 192.168.30.128 ?

No there is no other instance of the same service type on any of the other interfaces. And I cannot observe the issue when there are at least two services on the network with the same type.

And, could you try to turn on the logging using RUST_LOG=debug when running the querier example, to get the logs before the ServiceRemoved event?

Are the logs still needed, as you can reproduce the issue now?

hrzlgnm avatar May 12 '24 10:05 hrzlgnm

I was able to reproduce this issue. And I think the root cause is: we are not refreshing PTR DNS records, hence after a while PTR records expired and we remove the services.

The reason that the problem did not show up earlier is that we have a timer with exponential backoff (1s, 2s, 4s, ...), and these repeat queries would update the records. But later the gap is longer than the TTL 60 seconds, then the PTR record expires without proper refreshing.

I've opened PR #217 to fix this issue. Please check if that helps. I will try to come up with some test cases later.

Thanks for reporting this issue!

I've checked the fix from #217 and cannot observe the issue anymore when running the querier example for at least so long that the message At 1023.057225766s : SearchStarted("_test._tcp.local... was logged.

hrzlgnm avatar May 12 '24 11:05 hrzlgnm

Additionally i also tested ending the publishing python script, and restarting it, which has some interesting effect:

After a quit and restart of the publish3.py, the service is Resolved twice immediately.

At 8451.632279173s : ServiceFound("_test._tcp.local.", "void-void-vm._test._tcp.local.")
At 8451.632511101s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84
At 8451.644626053s: Resolved a new service: void-void-vm._test._tcp.local.
 host: void-void-vm.local.
 port: 4223
 Address: 192.168.178.84

A log with debug enabling showing this case where the publish3.py with reduced ttls to 10 was restarted 3 times and stopped at the end: restarts-of-publish-log2.log

hrzlgnm avatar May 12 '24 13:05 hrzlgnm

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes: https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name.

Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

keepsimple1 avatar May 13 '24 04:05 keepsimple1

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

I debugged a it more. Looks like it was because I forgot to add new refresh timers when a refresh time is further updated to 85%, or 90% or 95% of the TTL. I updated the PR to fix that. My testing seems to be good. If you could also test this updated patch, that will be great!

keepsimple1 avatar May 13 '24 04:05 keepsimple1

With a longer time testing, it seems the problem not completely gone. I will debug it further tomorrow.

I debugged a it more. Looks like it was because I forgot to add new refresh timers when a refresh time is further updated to 85%, or 90% or 95% of the TTL. I updated the PR to fix that. My testing seems to be good. If you could also test this updated patch, that will be great!

I tested the updated patch and it looks good.

hrzlgnm avatar May 13 '24 06:05 hrzlgnm

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes:

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service with a ttl=60 before. I'll investigate further later.

https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service and cached the entry for another interface with a ttl=60 before. I'll investigate further later.

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name.

Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

Yes let's move it to another issue, I'll open a separate one after investigating further why this is happening in my setup.

hrzlgnm avatar May 13 '24 06:05 hrzlgnm

From your log, it looks like it was because an address record was received with different cache_flush = false (ttl = 60) , which previously cases are cache_flush = true (ttl = 10). Such record is considered "new" by our current code as DnsEntry struct is different if cache_flush changes:

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service with a ttl=60 before. I'll investigate further later.

https://github.com/keepsimple1/mdns-sd/blob/5eb74b508d602ba2181029c6d6cb434d8dc540a5/src/dns_parser.rs#L64-L68

Here are all the Resolved events:

$ grep Resolved restarts-of-publish-log2.log                      
At 8.277988737s: Resolved a new service: void-void-vm._test._tcp.local.
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.
At 13.989534257s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.537252221s: Resolved a new service: void-void-vm._test._tcp.local.
At 20.544644141s: Resolved a new service: void-void-vm._test._tcp.local.
At 30.538414974s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.756557312s: Resolved a new service: void-void-vm._test._tcp.local.
At 178.760520066s: Resolved a new service: void-void-vm._test._tcp.local.
At 182.658263946s: Resolved a new service: void-void-vm._test._tcp.local.

For every pair of immediate Resolved events, the 2nd one seems to be caused by a "new" DnsAddress record. For example:

[2024-05-12T13:43:57.456Z DEBUG mdns_sd::dns_parser] DnsAddress { record: DnsRecord { entry: DnsEntry { name: "void-void-vm.local.", ty: 1, class: 1, cache_flush: false }, ttl: 60, created: 1715521437456, expires: 1715521497456, refresh: 1715521485456 }, address: 192.168.178.84 }
At 8.282031276s: Resolved a new service: void-void-vm._test._tcp.local.

Any ideas why some of the DnsAddress records has TTL 60 and cache-flush false ?

I've redone the test, and I indeed can observe the one with ttl=60 being send, despite using a ttl of 10. I guess there was another mdns responder like avahi involved, which saw the same service and cached the entry for another interface with a ttl=60 before. I'll investigate further later.

As the solution, I think we could add a logic that if cache_flush is false, we add it to the end of the Vec for the same record name. Btw, I think it is a separate issue from the original issue, so I think we can address it in a separate PR.

Yes let's move it to another issue, I'll open a separate one after investigating further why this is happening in my setup.

I've checked this again, and it was just a wrong usage of the zeroconf python API on my side, if one wants to limit the publish to only one interface, one has to pass this info to the Zeroconf() constructor. See updated publish4.py which does not expose the "weird behavior". So there is no real issue, just wrong API usage.

hrzlgnm avatar May 13 '24 19:05 hrzlgnm

PR #217 merged. Thank you for testing diligently!

keepsimple1 avatar May 14 '24 04:05 keepsimple1