cardano-node analysis failure: `LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069`

processing ../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/logs-node-9.flt.json
{ "tip": "a1fad4dfb8fc82410026f8edd7008af0c1e32b50837689e2dd378c3a14d0f65d" }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain-raw.txt" }
{ "filtered-chain-slot-domain": {"ddFilteredCount":39953,"ddFilteredFirst":16026,"ddFilteredLast":55979,"ddRawCount":65619,"ddRawFirst":865,"ddRawLast":66484} }
{ "filtered-chain-block-domain": {"ddFilteredCount":1815,"ddFilteredFirst":739,"ddFilteredLast":2554,"ddRawCount":3074,"ddRawFirst":0,"ddRawLast":3073} }
{ "chain": "../bench-1/runs/2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs/analysis/chain.txt" }

locli: LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069
CallStack (from HasCallStack):
  error, called at src/Cardano/Analysis/MachPerf.hs:343:10 in locli-1.29-inplace:Cardano.Analysis.MachPerf
real    9m8.705s

Jun 02 '22 16:06 deepfire

run 2022-06-01-18.53.1.35.0.9823.k51-5ep-360kTx-4000kU-1000kD-64kbs on AWS

Jun 02 '22 16:06 deepfire

point of error: https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L324

Jun 02 '22 16:06 deepfire

~The logic of error here is that we're seeing a LOTraceLeadershipDecided before LOTraceStartLeadershipCheck -- which indeed is weird, if true and if locli didn't trip over itself in some way.~

I've misread the basic message, see down below.

Jun 02 '22 16:06 deepfire

Whether it's true or not, can be determined from the source of truth -- the logs from the particular node -- however the error message is unhepfully not mentioning the particular node -- so we'll need to fix the error reporting here.

Jun 02 '22 17:06 deepfire

Ok, so we've ran the new locli on an old run, which we're 100% sure has all log messages proper -- and we've got the same failure.

So it is a regression in legacy trace handling.

Jun 02 '22 17:06 deepfire

Ok, giving --dump-logobjects to wb a will give us insight on how locli views the stream of logobjects:

wb --cls a --filters size-small --dump-logobjects std 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

Jun 02 '22 17:06 deepfire

Ok, so we know it's the performance-analysis portion of the analysis that fails (aka MachPerf in haskell speak).

Next thing -- iterating over all the gigabytes of logs is slow, and we know the host, whose logs trigger this, so we now have a new subcommand -- performance-single-host (or perf-single for short):

wb --cls a --filters size-small perf-single node-0 2022-05-09-15.24.1.34.1.36a3.k51-7ep-14kTx-4000kU-1000kD-73kbs-12MUTx-10BStTx-50MUBk-40BStBk-1i-1o--null

Jun 02 '22 20:06 deepfire

What does the error message say?

LeadDecided for noncurrent slot=SlotNo 4068 cur=SlotNo 4069

What does it mean, on the most basic level?

It means, that we've encountered a LOTraceLeadershipDecided with a slot number that is behind the analysis tip (cur)).

Hmm, so the handler for LOTraceLeadershipDecided expects to be operating on tip always. Is that a reasonable assumption?

No, it's not -- there can be things that legitimately progress analysis -- and break this assumption:

https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L247
https://github.com/input-output-hk/cardano-node/blob/master/bench/locli/src/Cardano/Analysis/MachTimeline.hs#L257

Of this, the second one is not the culprit -- LOTraceStartLeadershipCheck always happens before LOTraceLeadershipDecided for any given slot.

Hence, it must be the first one, for LOResources -- and those are indeed asynchronous -- there is no ordering wrt any other messages.

Jun 02 '22 22:06 deepfire

Ok, this fiasco with node-51 had another consequence -- crazy stats:

$ jq -C '.sResourceCDFs | .[] | select(objects) | select(.cdfAverage > 100000) | { cdfAverage: .cdfAverage, loc: input_filename }' logs-node-*.flt.perf-stats.json
{
  "cdfAverage": 261244935.45651916,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145198.9159274,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145168.1841264,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 3878145153.52692,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429010.34,
  "loc": "logs-node-51.flt.perf-stats.json"
}
{
  "cdfAverage": 273934423429033.2,
  "loc": "logs-node-51.flt.perf-stats.json"
}
``

Jun 24 '22 15:06 deepfire

The easy solution is to just exclude node-51 from machine performance analysis, only keeping its BlockProp contribution -- without which we can't reconstruct the blockchain.

This would need some analyse.sh extension -- since the same data files are fed into the same processing anaconda..

Jun 24 '22 15:06 deepfire

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

Oct 11 '25 01:10 github-actions[bot]