celestia-node icon indicating copy to clipboard operation
celestia-node copied to clipboard

[Feature Request]: Unified Telemetry with Push/Pull Model Support

Open aWN4Y25pa2EK opened this issue 4 months ago • 3 comments

Implementation ideas

celestia-node currently supports push-only telemetry for application metrics (OTLP) --metrics and pull-only for P2P metrics (Prometheus) --p2p.metrics. Would be great to have both models at hand as it would enable operators to choose between 2 different sets of operational logic on a telemetry level.

For example there are scenarios where someone might prefer to have a pull model because they operate a local prometheus endpoint.

Limitations

  • Application metrics cannot be scraped directly (no native Prometheus endpoint)
  • Fragmented metrics across different ports
  • Operators must choose between incomplete pull OR infrastructure-heavy push

Benefits

  • Operational Flexibility
  • Reduced Complexity
  • Consistency

So to recap:

  1. unified telemetry --metrics for everything
  2. support for pull and push model ex. --metrics --metrics.push /// --metrics --metrics.pull

Some other example flag combinations:

# Existing logic could become by default a `pull` model:
celestia light start --metrics

Simpler less flags required, unified telemetry, --metrics metrics.pull could too be supported if added as an input to keep logical consistency with the push model

# Push logic since it's more complex and requires more parameters by default could be:
celestia light start --metrics --metrics.push --metrics.tls=false --metrics.endpoint localhost:4318

Possible performance gains

Pull model provides safer resource consumption with lower operational risk we should benchmark it to know precisely what is the % overhead diff between both.

@pippokr /// @tty47

aWN4Y25pa2EK avatar Aug 07 '25 22:08 aWN4Y25pa2EK

Pull model would require for metrics collector to know location of the node, which is not possible most of the time. For the push approach there is wip PR to implement it: https://github.com/celestiaorg/celestia-node/pull/3702

walldiss avatar Aug 12 '25 15:08 walldiss

By location you're referring to the public IP of the node in terms of telemetry scrapping jobs ?

Theoretically yes but this wouldn't be a problem if an FQDN is used, the idea mostly is to support both push and pull models since push is forcing everyone to use otel which is by itself an additional overlay and requires more computational resources for continuos packet processing.

push model sometimes makes it difficult when I need fast confirmation of telemetry by default it forces me to have an otel upstream somewhere otherwise there is no way to see what the node is exporting.

aWN4Y25pa2EK avatar Aug 16 '25 10:08 aWN4Y25pa2EK

I agree, this is annoying and something we should fix. The best solution would be to make libp2p add support for the push model. Warrants an issue on go-libp2p

Wondertan avatar Aug 19 '25 09:08 Wondertan