cardano-node
cardano-node copied to clipboard
[FR] - Log reason when node switches tip to a fork
Internal/External External: Stake pool operator
Area Other: Logging
Describe the feature you'd like Please log the reason why the node preferred another fork
Describe alternatives you've considered
The independent tool cncli provides logging of block data including the block VRF value into a sqlite database. This database can be queried for insight into whether the likely reason the node preferred another fork was because the other fork terminal block had a lower block VRF and the tie break rule was applied.
Additional context / screenshots The node knows the reason for switching tip to a competing fork but this reason isn't logged. Rather the logs only confirm that the tip was switched. But, was the reason due to the competing fork being longer? Or was it due to the other fork's terminal block having a lower VRF causing the tie-break rule to be applied? Or did the node prefer another fork some other reason?
This information would be really helpful when trying to identify what caused your own block to not be adopted.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
Please stay as an open feature request.
Yes, please: if this is considered impractical to add I would like to hear why. Our SPO operation believes this is important for quality assurance purposes.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
@carbolymer what kind of engagement do the developers expect for this issue not to be Stale? There won't be anything new to report until someone in the relevant team finally responds.
This log info has been long & vitally requested for diagnostic purposes and the need for it is never going to go away. Those who need it most would be small stake pools with infrequent blocks... and therefore needed more in the near future if & when the K PoS parameter increases (as is constantly suggested in governance conversations).
If this issue keeps being dismissed, the implication is that QA isn't very important for stake pools and/or chain density isn't that important for Cardano itself. I don't think either of these things are true and I can state from SPO experience dating back to the beginning of the Shelley period that implementing this feature (which we arguably should have had from day 1) would be helpful for SPOs and node infrastructure operators at all levels of endeavour.
There was a slack discussion about this issue some time ago. @jasagredo @amesgen would you mind sharing main takeaways here?
The trace messages already contain enough detail to deduce the reason why a node changes its chain. Concretely, whenever we change our selection, we trace the so called "SelectView" of the tip of both the old and the new chain. This SelectView contains exactly the information that is needed in order to determine whether a chain is preferable to another chain. See PraosChainSelectView for more details and assorted documentation.
In Conway, the chain order works like this, via a lexicographical combinationi of the following criteria, when deciding whether to switch from the current chain to a candidate chain:
- compare chain length, preferring larger values
- when (same slot and same issuer) compare operational certificate issue number, preferring larger values
- when (slots differ by at most 5) compare VRF, preferring smaller values
By comparing the old and new SelectView, one can see which of these three rules was responsible for the chain switch:
Let's look at an example:
{
"at": "2024-12-13T13:50:58.295305594Z",
"ns": "ChainDB.AddBlockEvent.AddedToCurrentChain",
"data": {
"kind": "AddedToCurrentChain",
"newTipSelectView": {
"chainLength": 11214658,
"issueNo": 16,
"issuerHash": "8e0bb9b126acd8e65a023602377d9a7a6f15af394d1f7bdd852182b8",
"kind": "PraosChainSelectView",
"slotNo": 142531567,
"tieBreakVRF": "f8f31aabb555124392485cbd7ba02bded4e4eac8980fc36fde0c9a21d7cf2d51c6fd4ddda8113153639db862781dafacbdd77da4e072452a330c32509506762e"
},
"newtip": "17d5b0f21e86503b41851640e9558610ebe9b7a62c96a08873eb4d0c7961a460@142531567",
"oldTipSelectView": {
"chainLength": 11214657,
"issueNo": 20,
"issuerHash": "c63dab6d780a74cbae2a27696c9723f55b3092b2bd001256df03827f",
"kind": "PraosChainSelectView",
"slotNo": 142531554,
"tieBreakVRF": "905ff37db665e86d60d93dabc92ec86124b1a22165e4968facca226e790d9b7dd3f60502fa84dcd30d95ac28a9278e5e72b8b2eb451759ab59c92fc65909a2b4"
}
},
"sev": "Notice",
"thread": "76",
"host": "ramify"
}
Here, we can see that the chainLength of newTipSelectView is larger than the one of the chainLength of the oldTipSelectView, so the chain switch happened due to rule 1.
If these would be equal, one would have to look at rule 2 next (and then rule 3) to see whether it is responsible.
For convenience/simplicity, we could further enrich the trace message to directly include eg a reason field that says something like "LongerChain" or "BetterTieBreaker".
cc @mgmeier as you might have already started with this.
That sort of log information would absolutely suit my needs. I wasn't aware we could get the "tieBreakVRF" values logged ordinarily.
What do I need to configure in my node config.json file so that such information is logged in my systemd journal?
In both the old and new tracing system, you should see this when you enable JSON output and use a severity of at least Notice for ChainDB.AddBlockEvent.AddedToCurrentChain/ChainDB.AddBlockEvent.SwitchedToAFork.
Thanks @amesgen for summarizing the discussion we had (BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one). I'll add some more
- As mentioned the above trace messages / tracers need to be enabled, and severity filtering should be configured to include messages of serverity
Notice. - ~In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least
DDetailed.~ - We have it on our list to add a convenience field to the trace message, stating a reason for switching to a fork (should that happen). Would this make sense to provide at normal detail level, without knowing the underlying detailed criteria (like chain length, opcert no and VRF)?
- Point 3 will likely happen for the new tracing system only, as the old system will be maintenance-only starting with Node release 10.2 - and not receive additional features.
BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one
Thanks for catching that!
- In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least
DDetailed.
In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail:
https://github.com/IntersectMBO/cardano-node/blob/1774c93f9174022f359fab3bfe81d9427e32634a/cardano-node/src/Cardano/Node/Tracing/Tracers/ChainDB.hs#L511-L518
In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail
Aplogies, yes, the info will be traced at default detail level. I was looking at a local branch where I was toying around with the settings :-|
In both the old and new tracing system, you should see this when you enable JSON output
@amesgen @mgmeier our pool has 4+ years of homebrewed automation driven from text-based output ("ViewMode": "SimpleView").
Does it follow from https://github.com/IntersectMBO/cardano-node/issues/6014#issuecomment-2574907814 that AddedToCurrentChain and SwitchedToAFork are not visible from SimpleView, and that therefore we would have to change to JSON logging to see these events?
Which tracing system are you using (i.e., what is your config value for UseTraceDispatcher)?
Whatever the default would be for the node: we don't have this setting in any of our config files.
Then it is the old system, as currently it defaults to false.
As far as I can see, the info for {new,old}TipSelectView is indeed only available for JSON logging.
However, here's an idea about how to scrape that info without breaking existing automation. In the node config, you could set up an additional scribe to a file, having "scFormat": "ScJson". You would then need to create a script, possibly based on grep and jq, that monitors the trace messages (=JSON objects) going to that file. When you encounter a message indicating switching to a fork, the script could inspect its data and send an event to the systemd journal, to merge with the rest of the logging data in text-based format. If you configure log rotation or somesuch for that file scribe, the script would need to take that into account somehow as well.
Would that work for you somehow?
yes, thanks @mgmeier - to log these events we'll fork off another log as you suggest. 🙏 If whatever changes are contemplated here produce effects in SimpleView I trust we will be able to find out about them here or in a linked thread.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
@TerminadaPool - as the OP, how do you feel about requiring a 2nd JSON format log output to make these events visible? And is this issue resolved as you see it? (the marking Stale as a foot-in-the-door to closing this FR doesn't seem like a proper resolution to me: seems like it should be either closed "Won't Fix" or given a label that keeps it open until the feature is provided).
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
@github-actions I am running out of things to say here - cc @mgmeier @amesgen @TerminadaPool
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
Hello @github-actions!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
@github-actions @mgmeier I do think this issue is still relevant and still hope that it can be catalogued in such a way that we don't have to confirm this every 30 days to avoid the issue becoming "stale" and disappearing.
I don't actually separately configure an extra .json file output for logging but instead use the normal journal output. I would appreciate a normal log message stating the reason for why the node switched tip.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.
🎶
In node version 10.4.1 I have configured MachineFormat for Stdout backend in TraceOptions:
"TraceOptions": {
"": {
"backends": [
"Stdout MachineFormat",
MachineFormat provides information like this for each block adopted:
Jun 18 18:50:40 relay2 cardano-node[2570632]: {"at":"2025-06-18T08:50:40.790719758Z","ns":"ChainDB.AddBlockEvent.AddedToCurrentChain","data":{"kind":"AddedToCurrentChain","newTipSelectView":"chainLength":12012087,"issueNo":14,"issuerHash":"8d2fef46df5f9b2fcb64ebad739f27062b4f38c04df1a916fa216752","kind":"PraosChainSelectView","slotNo":158670349,"tieBreakVRF":"42e07a46dc4aeccb949caaef8c3341392ae0744a361fd854123d314f37723702950c80615201b90b18276239375a1f02ebe189e1243260347d9cc039941804bb"},"newtip":"6b01316c4536981cb767ff63521aa3d857a83f271e040a89d85471ea5fbc097f@158670349","oldTipSelectView":"chainLength":12012086,"issueNo":26,"issuerHash":"e985b5698de1386439e62b5b7ab2cc9f814b91b4a1fef5970f38694d","kind":"PraosChainSelectView","slotNo":158670327,"tieBreakVRF":"f5d16073b6bad50b1517830e70ef7a252ceada181cd7b4e7f6f35908487e7be14f7859a7f7c2d8997c29e9b23a9a844dd69dd3c03dbc72491877ea550cbd0247"}},"sev":"Notice","thread":"79","host":"relay2"}
It is a bit of a pain to read this json format in the journal, but you can see the tieBreakVRF value. When there is a "switched tip to new fork" event you can see that the node followed the fork with the lower tieBreakVRF.
Unfortunately you don't get the tieBreakVRF values in the logs if you use the human readable formats: HumanFormatColoured or HumanFormatUncoloured. But at least the spelling of "coloured" is correct. :)
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.