hydra
hydra copied to clipboard
Collect & report metrics
What & Why
To measure success (or failure) of the Hydra Head project and improve continuously, we need to know how many Hydra Heads are opened, how long they are used, how many UTXOs are moved into / out of a Head etc. Most of this information is publicly available and can be derived by observing the main-chain. The remainder (e.g. transactions sizes & number of UTXOs in a Head), will be collected from within the hydra-node and will be opt-out once we reach mainnet maturity.
TBD
- Detail what we want to collect
- Scope out reporting infrastructure
- Is this a also useful to implement watch-tower functionality and thus make custodial Hydra Heads more "trustworthy" when they provide this telemetry to their users (or watchtowers)?
Tasks
- [ ] Chain observer tracking Head transactions and aggregating Head information
- [ ] Explorer service using Head information
With a stateless "chain observer" available, we could host a simple "Hydra Head Explorer" service online that would show and track the state of heads running on some chain?
Couple of basic ideas:
- What are interesting metrics to collect off-chain?
- We already publish prometheus metrics inside the hydra-node, we could simply add a sidecar that scrapes it and send data to a public grafana cloud instance
- Other part could be handled by observing the chain
I have setup and used jaeger and zipkin in the past, including inside Haskell apps and having a way to track the processing of user requests across a distributed system is invaluable to understand its behaviour.
Looking at https://github.com/ethercrow/opentelemetry-haskell which provides support for traces. Someone pointed me at https://opentelemetry.io/docs/concepts/data-collection/ which provides a conceptual framework for all kind of "observability" data collection. In particular, opentelemetry (used to be called openjaeger) defines some standards to provide interoperability between various kind of services, allowing for example to collect and export Prometheus metrics, logs and traces to some other service.
We currently expose the following metrics in the node:
- number of events
- number of requested txs
- number of confirmed txs
- tx confirmation time histogram
Handling and possibly tuning of snapshots size is important for the protocol so we should add:
- number of snapshots
- number of tx/snapshot
- snapshot confirmation time
Also:
- event queue length, to track possible congestions/loopholes
- system-level resources (CPU, RAM, Network traffic)
- number UTxO in internal ledger
Traces could be an interesting addition to analyse the trace generated by a NewTx coming from a client and how it spreads across the network until the transaction becomes confirmed. This would be helpful in particular to understand the behaviour of the network if/when we move away from fully connected network to something more dynamic or less densely connected, with routing between the nodes. Not sure if it's worthwhile to do it now though.
Tasks for this feature:
- [ ] setup a central collection host/system with authenticated access
- [ ] deploy an opentelemetry sidecar instead of a prometheus server within hydra stack
- [ ] configure opentelemetry to send metrics to central host (with certificate)
- opting out simply means not deploying the sidecar