darwinia icon indicating copy to clipboard operation
darwinia copied to clipboard

UnboundedChannelPersistentlyLarge

Open stakeworks opened this issue 1 year ago • 3 comments

I'm getting these alerts on both Darwinia & Crab nodes:

  • alertname = UnboundedChannelPersistentlyLarge
  • chain = crab2
  • entity = mpsc_import_notification_stream
  • instance = localhost:19615
  • job = crab-collator
  • monitor = CMN02
  • severity = warning Annotations:
  • message = Channel mpsc_import_notification_stream on node localhost:19615 contains more than 200 items for more than 5 minutes. Node might be frozen.

Ubuntu 20.04 & 22.04 Binary: 6.3.4-e9430a36653

ExecStart=/darwinia
--collator
--chain=crab
--base-path /base-path/
--name 'StakeWorks | Crab | CMN02'
--execution wasm
--prometheus-port 19615
--prometheus-external
--listen-addr /ip4/xx.xx.xx.xx/tcp/30313/ws
--listen-addr /ip6/xx:xx:xx:xx::1/tcp/30313/ws
--
--execution wasm
--chain=kusama
--base-path /base-path/
--sync=warp
--state-pruning 1000
--blocks-pruning 1000
--out-peers 15
--in-peers 35 \

stakeworks avatar Jul 21 '23 10:07 stakeworks

Can you try v6.4.0?

aurexav avatar Sep 14 '23 10:09 aurexav

I've lowered the limit from the UnboundedChannelPersistentlyLarge alert from 750 to 200 (normal). Nodes is already running v6.4.0. Let you know if alert is triggered again.

stakeworks avatar Sep 14 '23 13:09 stakeworks

Alert is also triggered in v6.4.0, but until now, only with Crab2. This is the alert syntax:

  - alert: UnboundedChannelPersistentlyLarge
    expr: '(
        (substrate_unbounded_channel_len{action = "send"} -
            ignoring(action) substrate_unbounded_channel_len{action = "received"})
        or on(instance) substrate_unbounded_channel_len{action = "send"}
    ) >= 200'
    for: 5m
    labels:
      severity: warning
    annotations:
      message: 'Channel {{ $labels.entity }} on node {{ $labels.instance }} contains more than 200 items for more than 5 minutes. Node might be frozen.'

I will raise the time from 5 to 10 minutes and monitor what happens.

Update 19-9: Alert still triggered with 10 minutes. Changed time back to 5 minutes and number of items from 200 to 500. No alerts since a few days now.

stakeworks avatar Sep 14 '23 17:09 stakeworks