analysis failure: `LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069`
processing ../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/logs-node-9.flt.json
{ "tip": "a1fad4dfb8fc82410026f8edd7008af0c1e32b50837689e2dd378c3a14d0f65d" }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain-raw.txt" }
{ "filtered-chain-slot-domain": {"ddFilteredCount":39953,"ddFilteredFirst":16026,"ddFilteredLast":55979,"ddRawCount":65619,"ddRawFirst":865,"ddRawLast":66484} }
{ "filtered-chain-block-domain": {"ddFilteredCount":1815,"ddFilteredFirst":739,"ddFilteredLast":2554,"ddRawCount":3074,"ddRawFirst":0,"ddRawLast":3073} }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain.txt" }
locli: LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
CallStack (from HasCallStack):
error, called at src/Cardano/Analysis/MachPerf.hs:343:10 in locli-1.29-inplace:Cardano.Analysis.MachPerf
real 9m8.705s
run 2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs on AWS
- point of error: https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L324
~The logic of error here is that we're seeing a LOTraceLeadershipDecided before LOTraceStartLeadershipCheck -- which indeed is weird, if true and if locli didn't trip over itself in some way.~
I've misread the basic message, see down below.
Whether it's true or not, can be determined from the source of truth -- the logs from the particular node -- however the error message is unhepfully not mentioning the particular node -- so we'll need to fix the error reporting here.
Ok, so we've ran the new locli on an old run, which we're 100% sure has all log messages proper -- and we've got the same failure.
So it is a regression in legacy trace handling.
Ok, giving --dump-logobjects to wb a will give us insight on how locli views the stream of logobjects:
wb --cls a --filters size-small --dump-logobjects std 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null
Ok, so we know it's the performance-analysis portion of the analysis that fails (aka MachPerf in haskell speak).
Next thing -- iterating over all the gigabytes of logs is slow, and we know the host, whose logs trigger this, so we now have a new subcommand -- performance-single-host (or perf-single for short):
wb --cls a --filters size-small perf-single node-0 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null
What does the error message say?
LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
What does it mean, on the most basic level?
It means, that we've encountered a LOTraceLeadershipDecided with a slot number that is behind the analysis tip (cur)).
Hmm, so the handler for LOTraceLeadershipDecided expects to be operating on tip always.
Is that a reasonable assumption?
No, it's not -- there can be things that legitimately progress analysis -- and break this assumption:
- https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L247
- https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L257
Of this, the second one is not the culprit -- LOTraceStartLeadershipCheck always happens before LOTraceLeadershipDecided for any given slot.
Hence, it must be the first one, for LOResources -- and those are indeed asynchronous -- there is no ordering wrt any other messages.
Ok, this fiasco with node-51 had another consequence -- crazy stats:
$ jq -C '.sResourceCDFs | .[] | select(objects) | select(.cdfAverage > 100000) | { cdfAverage: .cdfAverage, loc: input_filename }' logs-node-*.flt.perf-stats.json
{
"cdfAverage": 261244935.45651916,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145198.9159274,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145168.1841264,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 3878145153.52692,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 273934423429010.34,
"loc": "logs-node-51.flt.perf-stats.json"
}
{
"cdfAverage": 273934423429033.2,
"loc": "logs-node-51.flt.perf-stats.json"
}
``
The easy solution is to just exclude node-51 from machine performance analysis, only keeping its BlockProp contribution -- without which we can't reconstruct the blockchain.
This would need some analyse.sh extension -- since the same data files are fed into the same processing anaconda..
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.