polkadot icon indicating copy to clipboard operation
polkadot copied to clipboard

PVF Validation checks for slashing

Open eskimor opened this issue 3 years ago • 3 comments

  • [ ] Is execution timeout actually taken into account in all the right places?
  • [x] Do we have metrics on validation execution time?
  • [ ] Double check that this metric is measuring the right thing (the time that if exceeded will cause validation to fail).
  • [ ] Monitor that metric on our validators over a long period of time (weeks) and see how much it fluctuates on a single validator/ across our validators. - It is important to not even that out, so we are interested in maximums here.
  • [ ] Check that we have precise logging on the actual cause of a validation error.
  • [ ] Check that the time a validation took is logged.
  • [ ] Examine those logs (might reveal something that gets lost in metrics due to averaging)
  • [ ] Get those logs from validators that are being slashed.
  • [ ] Add an alert if validation time gets anywhere close to the timeout in approval checking.

eskimor avatar Sep 01 '22 08:09 eskimor

What logs should validators have enabled, is WARN/INFO enough to get useful info?

eskimor avatar Sep 01 '22 14:09 eskimor

Parity Kusama validators barely ever go above 2s, only once (still below 3) within the last two weeks:

https://grafana.parity-mgmt.parity.io/goto/F3xTfeZ4k?orgId=1

Screenshot from 2022-09-01 17-58-07

Thanks @ordian!

eskimor avatar Sep 01 '22 16:09 eskimor

What logs should validators have enabled, is WARN/INFO enough to get useful info?

parachain::candidate-validation=debug would be useful.

ordian avatar Sep 01 '22 16:09 ordian

It might be a bit early to have definite results, but it does not look like the approval voting timeout doubling did not entirely fix the problem: Screenshot from 2022-09-26 18-22-53

Which is expected, as we know of at least two other reasons for disputes by now:

#6041 #6057

eskimor avatar Sep 26 '22 17:09 eskimor