DNS TTL Not Respected by celestia-node Leading to Sync Issues
Description:
We encountered an issue where changes to the DNS entries of DA nodes in Arabica caused light nodes to fail to sync. Restarting the light nodes resolved the issue, indicating that they resolve DNS once at startup and then use the same IP address indefinitely, ignoring DNS TTL.
Steps to Reproduce:
- Change the DNS entries for nodes.
- Observe light nodes failing to sync.
- Restart the light nodes and observe they can sync.
Suspected Cause: Light nodes resolve DNS entries only once at startup and continue using the same IP address without respecting the TTL. This affects both:
- DA nodes connecting to other DA nodes.
- DA Bridge nodes connecting to consensus nodes.
Relevant Code:
multiaddrDNS resolution: I could not find the relevant code.--core.ipDNS resolution: https://github.com/celestiaorg/celestia-node/blob/f98d632818d566c7b4fd995b0f4bdc6443a7ed06/nodebuilder/core/config.go#L47
Potential Fix:
- Periodically re-resolve DNS entries based on the TTL.
- Update active connections if the resolved IP address changes.
Repositories Potentially Needing Changes:
Impact: Not respecting DNS TTL can lead to connectivity and sync issues, affecting network reliability.
Request for Assistance:
- Identify where DNS resolution is handled in the codebase and dependencies.
- Implement periodic DNS resolution based on TTL.
- Test changes to ensure nodes dynamically update connections based on DNS updates.
i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work
unless there was there good a reason we HAD to resolve IP?
i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work
unless there was there good a reason we HAD to resolve IP?
If I understand the code correctly, this would only resolve one part of the issue: the connection between the DA BN and the consensus node.
From my understanding, this code is not called when resolving the DNS in a multiaddr.
One workaround for this issue would be to recreate the connection once it fails after the IP address changes. This way, we don't need to add support to handle the DNS TTL, and the node would request the new IP address from the DNS server.
What's status on this @ramin @smuu ?
Fixed in #3624