[Feature Request]: Unified Telemetry with Push/Pull Model Support
Implementation ideas
celestia-node currently supports push-only telemetry for application metrics (OTLP) --metrics and pull-only for P2P metrics (Prometheus) --p2p.metrics. Would be great to have both models at hand as it would enable operators to choose between 2 different sets of operational logic on a telemetry level.
For example there are scenarios where someone might prefer to have a pull model because they operate a local prometheus endpoint.
Limitations
- Application metrics cannot be scraped directly (no native Prometheus endpoint)
- Fragmented metrics across different ports
- Operators must choose between incomplete
pullOR infrastructure-heavy push
Benefits
- Operational Flexibility
- Reduced Complexity
- Consistency
So to recap:
- unified telemetry
--metricsfor everything - support for
pullandpushmodel ex.--metrics --metrics.push///--metrics --metrics.pull
Some other example flag combinations:
# Existing logic could become by default a `pull` model:
celestia light start --metrics
Simpler less flags required, unified telemetry, --metrics metrics.pull could too be supported if added as an input to keep logical consistency with the push model
# Push logic since it's more complex and requires more parameters by default could be:
celestia light start --metrics --metrics.push --metrics.tls=false --metrics.endpoint localhost:4318
Possible performance gains
Pull model provides safer resource consumption with lower operational risk we should benchmark it to know precisely what is the % overhead diff between both.
@pippokr /// @tty47
Pull model would require for metrics collector to know location of the node, which is not possible most of the time. For the push approach there is wip PR to implement it: https://github.com/celestiaorg/celestia-node/pull/3702
By location you're referring to the public IP of the node in terms of telemetry scrapping jobs ?
Theoretically yes but this wouldn't be a problem if an FQDN is used, the idea mostly is to support both push and pull models since push is forcing everyone to use otel which is by itself an additional overlay and requires more computational resources for continuos packet processing.
push model sometimes makes it difficult when I need fast confirmation of telemetry by default it forces me to have an otel upstream somewhere otherwise there is no way to see what the node is exporting.
I agree, this is annoying and something we should fix. The best solution would be to make libp2p add support for the push model. Warrants an issue on go-libp2p