eclair icon indicating copy to clipboard operation
eclair copied to clipboard

Add akka metrics grafana dashboard

Open GoutamVerma opened this issue 2 years ago • 2 comments

  • Channels (4 panels) channel(akka)

  • Register (4 panels) register(akka)

  • Peers (4 panels) peers(akka)

GoutamVerma avatar Jul 19 '22 13:07 GoutamVerma

The akka metrics we're interested in are:

  • akka.group.time-in-mailbox -> this is the time spent before handling a message, it should always stay small
  • akka.group.processing-time -> this is the time it takes to handle a message, it should also always stay small

You'll see that these metrics are emitted for every actor in the system, by using the group tag. You should filter this metric for specific group values depending on what actor you want to monitor. The actors we want to monitor for this first version are:

  • the Peer actor: group = eclair-node/user/SimpleSupervisor/Switchboard/Peer
  • the Channel actor: group = eclair-node/user/SimpleSupervisor/Switchboard/Peer/Channel
  • the Register actor: group = eclair-node/user/SimpleSupervisor/Register

Here is a screenshot of what we currently have with Kamon for the Peer actor:

akka-metrics

Notice that for each of the metrics, we create two graphs to display it in two different ways:

  • a heatmap summarizing the distribution of values
  • a percentile view (you will need to learn how to query for percentiles in prometheus and grafana, which is an important exercise - just search online for documentation and tutorials on how to do that)

t-bast avatar Jul 29 '22 08:07 t-bast

  • a heatmap summarizing the distribution of values

I have create a two heatmap for each actors(time in mailbox, processing-time) and also added it in PR.

  • a percentile view (you will need to learn how to query for percentiles in prometheus and grafana, which is an important exercise - just search online for documentation and tutorials on how to do that)

To view percentile in grafana and prometheus, promql has query quantile_over_time. I used this query and generate below panels for histogram and timeseries.

Query: quantile_over_time(0.95rate(akka_group_time_in_mailbox_seconds_count{group="eclairnode/user/SimpleSupervisor/Switchboard/Peer/Channel"}[5m])[$__range:5m])

Histogram: histogram

Timeseries: timeseries

More info about percentile in grafana and prometheus

GoutamVerma avatar Aug 09 '22 07:08 GoutamVerma

Codecov Report

Merging #2347 (aa5a787) into master (e1dc358) will increase coverage by 0.19%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2347      +/-   ##
==========================================
+ Coverage   84.68%   84.88%   +0.19%     
==========================================
  Files         194      198       +4     
  Lines       14650    15277     +627     
  Branches      613      640      +27     
==========================================
+ Hits        12407    12968     +561     
- Misses       2243     2309      +66     
Impacted Files Coverage Δ
.../scala/fr/acinq/eclair/payment/PaymentPacket.scala 73.03% <0.00%> (-18.28%) :arrow_down:
...a/fr/acinq/eclair/wire/protocol/PaymentOnion.scala 92.98% <0.00%> (-6.01%) :arrow_down:
.../fr/acinq/eclair/wire/protocol/RouteBlinding.scala 96.00% <0.00%> (-4.00%) :arrow_down:
...q/eclair/wire/protocol/LightningMessageTypes.scala 94.64% <0.00%> (-2.98%) :arrow_down:
...main/scala/fr/acinq/eclair/db/jdbc/JdbcUtils.scala 88.23% <0.00%> (-2.95%) :arrow_down:
...la/fr/acinq/eclair/channel/fsm/ErrorHandlers.scala 80.39% <0.00%> (-1.25%) :arrow_down:
...scala/fr/acinq/eclair/router/BalanceEstimate.scala 98.91% <0.00%> (-1.09%) :arrow_down:
...la/fr/acinq/eclair/wire/protocol/OfferCodecs.scala 96.82% <0.00%> (-0.80%) :arrow_down:
...r/acinq/eclair/payment/send/PaymentLifecycle.scala 86.93% <0.00%> (-0.79%) :arrow_down:
...main/scala/fr/acinq/eclair/io/PeerConnection.scala 86.02% <0.00%> (-0.69%) :arrow_down:
... and 44 more

codecov-commenter avatar Aug 17 '22 05:08 codecov-commenter

Let's merge this as-is, we can always improve it later. I strongly encourage node operators who want to experiment with these new monitoring dashboards to challenge the values they see and contribute back by enriching the graphs or fixing what may be incorrect.

t-bast avatar Aug 22 '22 13:08 t-bast