celestia-node icon indicating copy to clipboard operation
celestia-node copied to clipboard

DNS TTL Not Respected by celestia-node Leading to Sync Issues

Open smuu opened this issue 1 year ago • 4 comments

Description:

We encountered an issue where changes to the DNS entries of DA nodes in Arabica caused light nodes to fail to sync. Restarting the light nodes resolved the issue, indicating that they resolve DNS once at startup and then use the same IP address indefinitely, ignoring DNS TTL.

Steps to Reproduce:

  1. Change the DNS entries for nodes.
  2. Observe light nodes failing to sync.
  3. Restart the light nodes and observe they can sync.

Suspected Cause: Light nodes resolve DNS entries only once at startup and continue using the same IP address without respecting the TTL. This affects both:

  • DA nodes connecting to other DA nodes.
  • DA Bridge nodes connecting to consensus nodes.

Relevant Code:

  • multiaddr DNS resolution: I could not find the relevant code.
  • --core.ip DNS resolution: https://github.com/celestiaorg/celestia-node/blob/f98d632818d566c7b4fd995b0f4bdc6443a7ed06/nodebuilder/core/config.go#L47

Potential Fix:

  1. Periodically re-resolve DNS entries based on the TTL.
  2. Update active connections if the resolved IP address changes.

Repositories Potentially Needing Changes:

Impact: Not respecting DNS TTL can lead to connectivity and sync issues, affecting network reliability.

Request for Assistance:

  1. Identify where DNS resolution is handled in the codebase and dependencies.
  2. Implement periodic DNS resolution based on TTL.
  3. Test changes to ensure nodes dynamically update connections based on DNS updates.

smuu avatar Jul 17 '24 12:07 smuu

i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work

unless there was there good a reason we HAD to resolve IP?

ramin avatar Jul 17 '24 12:07 ramin

i think simplest here is to remove https://github.com/celestiaorg/celestia-node/blob/main/libs/utils/address.go#L40 which resolves the IP a single time on start, instead letting clients use the domain as passed in, and relying on the infra of the internet to work

unless there was there good a reason we HAD to resolve IP?

If I understand the code correctly, this would only resolve one part of the issue: the connection between the DA BN and the consensus node. From my understanding, this code is not called when resolving the DNS in a multiaddr.

smuu avatar Jul 17 '24 12:07 smuu

One workaround for this issue would be to recreate the connection once it fails after the IP address changes. This way, we don't need to add support to handle the DNS TTL, and the node would request the new IP address from the DNS server.

smuu avatar Jul 22 '24 13:07 smuu

What's status on this @ramin @smuu ?

renaynay avatar Jul 29 '24 09:07 renaynay

Fixed in #3624

cristaloleg avatar Mar 13 '25 10:03 cristaloleg