eclair
eclair copied to clipboard
Add akka metrics grafana dashboard
-
Channels (4 panels)
-
Register (4 panels)
-
Peers (4 panels)
The akka metrics we're interested in are:
-
akka.group.time-in-mailbox
-> this is the time spent before handling a message, it should always stay small -
akka.group.processing-time
-> this is the time it takes to handle a message, it should also always stay small
You'll see that these metrics are emitted for every actor in the system, by using the group
tag. You should filter this metric for specific group
values depending on what actor you want to monitor. The actors we want to monitor for this first version are:
- the
Peer
actor:group = eclair-node/user/SimpleSupervisor/Switchboard/Peer
- the
Channel
actor:group = eclair-node/user/SimpleSupervisor/Switchboard/Peer/Channel
- the
Register
actor:group = eclair-node/user/SimpleSupervisor/Register
Here is a screenshot of what we currently have with Kamon for the Peer
actor:
Notice that for each of the metrics, we create two graphs to display it in two different ways:
- a heatmap summarizing the distribution of values
- a percentile view (you will need to learn how to query for percentiles in prometheus and grafana, which is an important exercise - just search online for documentation and tutorials on how to do that)
- a heatmap summarizing the distribution of values
I have create a two heatmap for each actors(time in mailbox, processing-time) and also added it in PR.
- a percentile view (you will need to learn how to query for percentiles in prometheus and grafana, which is an important exercise - just search online for documentation and tutorials on how to do that)
To view percentile in grafana and prometheus, promql has query quantile_over_time
. I used this query and generate below panels for histogram and timeseries.
Query: quantile_over_time(0.95rate(akka_group_time_in_mailbox_seconds_count{group="eclairnode/user/SimpleSupervisor/Switchboard/Peer/Channel"}[5m])[$__range:5m])
Histogram:
Timeseries:
More info about percentile in grafana and prometheus
Codecov Report
Merging #2347 (aa5a787) into master (e1dc358) will increase coverage by
0.19%
. The diff coverage isn/a
.
@@ Coverage Diff @@
## master #2347 +/- ##
==========================================
+ Coverage 84.68% 84.88% +0.19%
==========================================
Files 194 198 +4
Lines 14650 15277 +627
Branches 613 640 +27
==========================================
+ Hits 12407 12968 +561
- Misses 2243 2309 +66
Impacted Files | Coverage Δ | |
---|---|---|
.../scala/fr/acinq/eclair/payment/PaymentPacket.scala | 73.03% <0.00%> (-18.28%) |
:arrow_down: |
...a/fr/acinq/eclair/wire/protocol/PaymentOnion.scala | 92.98% <0.00%> (-6.01%) |
:arrow_down: |
.../fr/acinq/eclair/wire/protocol/RouteBlinding.scala | 96.00% <0.00%> (-4.00%) |
:arrow_down: |
...q/eclair/wire/protocol/LightningMessageTypes.scala | 94.64% <0.00%> (-2.98%) |
:arrow_down: |
...main/scala/fr/acinq/eclair/db/jdbc/JdbcUtils.scala | 88.23% <0.00%> (-2.95%) |
:arrow_down: |
...la/fr/acinq/eclair/channel/fsm/ErrorHandlers.scala | 80.39% <0.00%> (-1.25%) |
:arrow_down: |
...scala/fr/acinq/eclair/router/BalanceEstimate.scala | 98.91% <0.00%> (-1.09%) |
:arrow_down: |
...la/fr/acinq/eclair/wire/protocol/OfferCodecs.scala | 96.82% <0.00%> (-0.80%) |
:arrow_down: |
...r/acinq/eclair/payment/send/PaymentLifecycle.scala | 86.93% <0.00%> (-0.79%) |
:arrow_down: |
...main/scala/fr/acinq/eclair/io/PeerConnection.scala | 86.02% <0.00%> (-0.69%) |
:arrow_down: |
... and 44 more |
Let's merge this as-is, we can always improve it later. I strongly encourage node operators who want to experiment with these new monitoring dashboards to challenge the values they see and contribute back by enriching the graphs or fixing what may be incorrect.