nimbus-eth2 attestation pool timing

Do not merge.

Jul 30 '22 15:07 tersec

Unit Test Results

      12 files     860 suites 1h 20m 31s :stopwatch:   1 911 tests   1 764 :heavy_check_mark: 147 :zzz: 0 :x: 10 341 runs 10 151 :heavy_check_mark: 190 :zzz: 0 :x:

Results for commit d69186b1.

:recycle: This comment has been updated with latest results.

Jul 30 '22 18:07 github-actions[bot]

Wouldn't some metrics be better instead?

Jul 30 '22 21:07 mratsim

For capturing timing, not necessarily. Logs come with timestamps, and any pair of logs can therefore be compared. They exist in a global/shared timeline, unlike metrics. It's also more straightforward/freeform to add arbitrary new logging for this purpose (the per-loop iterations, for example).

Of course, metrics have ways have handling all of these things, but by their nature, they aggregate, and are normally viewed through the refracting lenses of Prometheus/Grafana, whereas for this diagnostic purpose, I'm looking for the most direct view of what Nimbus is doing as possible, in a way that's possible for the person running this to copy and paste back.

This is (hopefully the three distinct cues not to merge was a small hint) not a permanent PR, but specifically for a branch to investigate a reported timing issue. This is not something meant for people's dashboards for months/years to come.

Jul 30 '22 22:07 tersec

This is (hopefully the three distinct cues not to merge was a small hint) not a permanent PR, but specifically for a branch to investigate a reported timing issue. This is not something meant for people's dashboards for months/years to come.

Actually that is the question. Is this a function important or tricky enough that we want to monitor it for performance regressions?

Jul 31 '22 09:07 mratsim

This turned out not to be the problem, so closing. If people think that it's worth doing some of the logging discussed here, that can be raised in a separate issue.

Aug 16 '22 05:08 tersec

If people think that it's worth doing some of the logging discussed here, that can be raised in a separate issue.

I think the idea of info-logging operations that take longer than say .. 250ms has validity - it has certainly helped us in the past to put a spotlight on issues, and in there, I think it's a good idea to split up the various durations that go into producing a block (ie packing attestations, waiting for randao, waiting for eth1 and finally waiting for signature..

Aug 16 '22 11:08 arnetheduck

https://github.com/status-im/nimbus-eth2/tree/measure-block-proposal-timing does this in a way that, while conceivably doable in stable/unstable, probably would need refining. One aspect is that it intentionally doesn't try to time durations, but log and just exploit that logging includes timestamps, which affords it a flexibility (n checkpoints allows O(n^2) possible durations) but also a rawness that might be less appropriate in a more focused way.

There'd always be a tension for adding these to INFO in a permanent way, but for most people block proposals are rare enough that it's probably not a major log spam risk to add O(1) log messages per block, even if that's several.

Aug 24 '22 04:08 tersec

doesn't try to time durations

I've on purpose called all durations xxxDur so that a grep Dur=[0-9]+s gives anything that takes a second or more :) it's possible to build a line differ that looks at timestamps, but it can't tell a "normal" gap between to log lines from an "interesting" one without further filtering, making it too noisy for general-purpose scanning

Aug 24 '22 06:08 arnetheduck

nimbus-eth2 nimbus-eth2 copied to clipboard

attestation pool timing

Unit Test Results

nimbus-eth2
nimbus-eth2 copied to clipboard