penumbra metrics: surface validator status, e.g. jailed

During Testnet 64 folks in Discord have reported an uptick in jailing rates. It's difficult to reason concretely about that observation because we don't have good visibility into rates of banning. We do emit metrics such as penumbra_stake_validators_jailed in pd, but aren't visualizing those anywhere out of the box. We also try to log the status change via a span: https://github.com/penumbra-zone/penumbra/blob/v0.64.2/crates/core/component/stake/src/component.rs#L125-L135 but oddly I'm not seeing that span label return results in our 64 logs, even though we have debug logs on for the validators we run.

Feb 06 '24 18:02 conorsch

From current testnet:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake | rg -v missed_blocks
penumbra_stake_validators_inactive 363
penumbra_stake_validators_jailed -23
penumbra_stake_validators_disabled 29
penumbra_stake_validators_active 0
penumbra_stake_validators_defined 64

There are a few problems in there, such as active=0 and "jailed" being a negative value. Let's inspect the transition logic in the staking component and make sure gauges are incremented and decremented correctly.

Feb 16 '24 18:02 conorsch

Similarly, the missed blocks metric is not accurate:

❯ curl -s http://localhost:9000/metrics | rg ^penumbra_stake_missed_blocks | sort -k 2 -n -r | head -n 20
penumbra_stake_missed_blocks{identity_key="penumbravalid14e54ukuppxs3f6dsg6flx9tsmqcv363tvvzrdku8fgmagjd435fqexu7zs"} 148359350
penumbra_stake_missed_blocks{identity_key="penumbravalid1u60u8405jg8mlz0epmhqfslqqkjurx6l4pm8n3j0ves66sztwqrqctw4f2"} 137059677
penumbra_stake_missed_blocks{identity_key="penumbravalid1qrxgv0nqm8a5edffr5t5ykhjuchkmwkejsgfhtc38t2y0kf33qrsql244n"} 133699997
penumbra_stake_missed_blocks{identity_key="penumbravalid1g3wfpjkplhktcq5d3t3l5ud4u34qwh6kxacwrkd8uacvh5jlcqqqzqnhxt"} 130544894
penumbra_stake_missed_blocks{identity_key="penumbravalid1ms5kyy9qz0dggseq9ud9gl2wmlyl8c767hrw66ku35tksczsmv8qxlxxsd"} 127920260
penumbra_stake_missed_blocks{identity_key="penumbravalid1qqqvkxl2eewjl0f3kakcvryv4g6kq6ua8w8xk4g5j7n9pn50ycqstgekuv"} 119628111
penumbra_stake_missed_blocks{identity_key="penumbravalid1sqmemw9qeswm8fmvk8p8g2z49rhrxkyu09h9keqz5s6xkxh3tc8q9j32w0"} 99881573
penumbra_stake_missed_blocks{identity_key="penumbravalid15tu5hycly4yltuk6c4vsx45kn003amet69eqp2sxcjnw46udaczqf3n354"} 92749412
penumbra_stake_missed_blocks{identity_key="penumbravalid1ajryejwx349xl9lvh8sym09jle9chaum4628k7c0n5rpaw9v2cgqrtn9z4"} 85621965
penumbra_stake_missed_blocks{identity_key="penumbravalid1uzn79d9s736t6nga7060ts9ghncuzhahr92nldamwhjhc726458qdf509n"} 57060921
penumbra_stake_missed_blocks{identity_key="penumbravalid1lndq7e036xvuu7s7kavf9ncnqq70czyyj8gdenyzn8k9cyuwvg9qqm33vm"} 48428299
penumbra_stake_missed_blocks{identity_key="penumbravalid1wpkhqt7p3t84ux3vxdtscp8t8rz89xrarfc0ey0z79tux0na0y8qz72s8r"} 47233969
penumbra_stake_missed_blocks{identity_key="penumbravalid1zjdq22kr0n0e7u3z363h9ussktu768nqpryr47y6w7rxylx75gpqyfkqw6"} 45140250
penumbra_stake_missed_blocks{identity_key="penumbravalid1nnl5c8ad9gm0kpc42ttmyw29vrrsjaurk8euuj73kj9hvape4vqsm8c2f6"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid182uscvf6qaqq3r7tdy3uzf3thrrym08gnzg3furvdgvfguvgn5qs97p3tm"} 44854894
penumbra_stake_missed_blocks{identity_key="penumbravalid1227gvj8qmfp9pwnxwue3smduhjjk6junn9pgu9p5lru52x6a4spq3m9t4d"} 42277330
penumbra_stake_missed_blocks{identity_key="penumbravalid1mzfd4lk4kpx5gtvwas6e26mhkzh986g7clqwmucxe64md8kuty9qs0pf9c"} 25093054
penumbra_stake_missed_blocks{identity_key="penumbravalid16jcq73cy93hx6dc857drpednfmpe032uxrvzjg00xc0ser54xq8s6j6jjp"} 10619136
penumbra_stake_missed_blocks{identity_key="penumbravalid18sgu2nrded90ntn747223ttv4ggna37t8d8wlfwj42qu487lw58q3adpln"} 10504890
penumbra_stake_missed_blocks{identity_key="penumbravalid15tjp4mww35qmq8fvpkhjsd87ndnj9pthwzf33sn0n6p6vncrvyrqlf7275"} 9501019

Those values are far too high: the total block height is only ~60k at time of writing. At a guess, we're incrementing the missed_blocks gauge by the entirety of historical missed blocks for that validator:

https://github.com/penumbra-zone/penumbra/blob/6ab217514f49c03eafbca1399fe57e19e6a96925/crates/core/component/stake/src/component/validator_handler/validator_manager.rs#L165-L166

That .increment() call should instead be .set() on the gauge.

Feb 16 '24 18:02 conorsch

I'm surprised increment(0) and set(0) are any different

edit: ah nvm. got it.

Feb 16 '24 20:02 erwanor

Ok I get why they're different, but not why those values are so high

Feb 16 '24 20:02 erwanor

Ok I get why they're different, but not why those values are so high

We should be computing missed_blocks += n but instead we're computing missed_blocks += (missed_blocks + n), is my understanding.

Feb 20 '24 21:02 conorsch

Okay, wow - I would think that increment(N) does counter += N and set(N) does counter = N but I guess not, good catch

Feb 20 '24 21:02 erwanor

x-ref #3995, i stumbled across this today. i'm going to self-assign this, because i believe i can fix it as part of #4070 / #4040.

Mar 22 '24 15:03 cratelyn

Took a closer look at this today. The missed_blocks issue is appears quite simple: the increment -> set fix discussed above should resolve.

As for the validator state metrics, the current setup in can't possibly work: we're incrementing and decrementing gauges based on validator state transitions, but gauges are by default zero, so we're actually emitting metrics on the validator state transitions observed by a specific instance of pd over the course of its uptime. Bouncing the pd service will wipe those metrics and cause them to tick up again over time. That's why we're seeing negative numbers, because a long-lived pd instance will start with its definition of jailed validators at 0, and as validators on the network enter jailed state, that number will veer into negatives.

To resolve, we must ground the metrics in actual historical state. Two options:

on pd start, query the stake service for current values of active, defined, jailed, etc, and initialize the gauges to those values; then leave the existing increment/decrement logic in place, and trust those values to be updated properly over time; or
modify the existing stake metrics function to perform the stake query, and set the gauges from those values, so it's always correct.

Here's a lightly edited terminal session illustrating the problem:

❯ # shortly after joining, the metrics are nil:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
❯ # a bit later, as an epoch was crossed during syncing up to current height
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort
penumbra_stake_validators_active -1
penumbra_stake_validators_jailed 1
❯ # evidence that we're still catching up:
❯ curl -s http://localhost:26657/status | jq -r .result.sync_info.catching_up
true
❯ # bounce the service, see relevant metrics reset:
❯ sudo systemctl restart penumbra cometbft
[sudo] password for conor:
❯ curl -s http://localhost:9000/metrics | grep ^penumbra_stake | grep -v missed_blocks | sort

Once fixed, we should tack on an integration test to make sure these metrics exist at network start: active validators should never be zero at time of join.

Apr 10 '24 23:04 conorsch

Should we be using metrics for this at all? I think the answer is probably no. I don't think we should be using metrics for anything that requires reconciling ongoing events with historical data. We already have a system for that, the event mechanism, and we should use it. And there is a standard pattern for rendering events on blockchains, namely a block explorer.

If someone wants to check if their validator is online or its status, they should open the validator's page in a block explorer. But we shouldn't try to build a half-featured block explorer inside Grafana just because our existing block explorer isn't functional.

Apr 11 '24 15:04 hdevalence

Should we be using metrics for this at all? [...] We already have a system for that, the event mechanism, and we should use it.

This is a great take, and we have progress toward the event pattern in #4277.

May 06 '24 19:05 conorsch

Closing in favor of https://github.com/penumbra-zone/penumbra/issues/4336, which reframes the ask as event emission.

May 06 '24 23:05 conorsch

penumbra penumbra copied to clipboard

metrics: surface validator status, e.g. jailed

penumbra
penumbra copied to clipboard