hydra Collect & report metrics

trafficstars

What & Why

To measure success (or failure) of the Hydra Head project and improve continuously, we need to know how many Hydra Heads are opened, how long they are used, how many UTXOs are moved into / out of a Head etc. Most of this information is publicly available and can be derived by observing the main-chain. The remainder (e.g. transactions sizes & number of UTXOs in a Head), will be collected from within the hydra-node and will be opt-out once we reach mainnet maturity.

TBD

Detail what we want to collect
Scope out reporting infrastructure
Is this a also useful to implement watch-tower functionality and thus make custodial Hydra Heads more "trustworthy" when they provide this telemetry to their users (or watchtowers)?

Tasks

[ ] Chain observer tracking Head transactions and aggregating Head information
[ ] Explorer service using Head information

Jan 30 '22 17:01 ch1bo

With a stateless "chain observer" available, we could host a simple "Hydra Head Explorer" service online that would show and track the state of heads running on some chain?

Feb 25 '22 11:02 abailly-iohk

Couple of basic ideas:

What are interesting metrics to collect off-chain?
We already publish prometheus metrics inside the hydra-node, we could simply add a sidecar that scrapes it and send data to a public grafana cloud instance
Other part could be handled by observing the chain

Apr 07 '22 12:04 abailly-iohk

I have setup and used jaeger and zipkin in the past, including inside Haskell apps and having a way to track the processing of user requests across a distributed system is invaluable to understand its behaviour.

Looking at https://github.com/ethercrow/opentelemetry-haskell which provides support for traces. Someone pointed me at https://opentelemetry.io/docs/concepts/data-collection/ which provides a conceptual framework for all kind of "observability" data collection. In particular, opentelemetry (used to be called openjaeger) defines some standards to provide interoperability between various kind of services, allowing for example to collect and export Prometheus metrics, logs and traces to some other service.

We currently expose the following metrics in the node:

number of events
number of requested txs
number of confirmed txs
tx confirmation time histogram

Handling and possibly tuning of snapshots size is important for the protocol so we should add:

number of snapshots
number of tx/snapshot
snapshot confirmation time

Also:

event queue length, to track possible congestions/loopholes
system-level resources (CPU, RAM, Network traffic)
number UTxO in internal ledger

Traces could be an interesting addition to analyse the trace generated by a NewTx coming from a client and how it spreads across the network until the transaction becomes confirmed. This would be helpful in particular to understand the behaviour of the network if/when we move away from fully connected network to something more dynamic or less densely connected, with routing between the nodes. Not sure if it's worthwhile to do it now though.

Tasks for this feature:

[ ] setup a central collection host/system with authenticated access
[ ] deploy an opentelemetry sidecar instead of a prometheus server within hydra stack
[ ] configure opentelemetry to send metrics to central host (with certificate)
opting out simply means not deploying the sidecar

Apr 26 '22 07:04 abailly-iohk

hydra hydra copied to clipboard

Collect & report metrics

What & Why

TBD

Tasks

hydra
hydra copied to clipboard