cardano-node icon indicating copy to clipboard operation
cardano-node copied to clipboard

[FR] - Log reason when node switches tip to a fork

Open TerminadaPool opened this issue 1 year ago • 35 comments

Internal/External External: Stake pool operator

Area Other: Logging

Describe the feature you'd like Please log the reason why the node preferred another fork

Describe alternatives you've considered The independent tool cncli provides logging of block data including the block VRF value into a sqlite database. This database can be queried for insight into whether the likely reason the node preferred another fork was because the other fork terminal block had a lower block VRF and the tie break rule was applied.

Additional context / screenshots The node knows the reason for switching tip to a competing fork but this reason isn't logged. Rather the logs only confirm that the tip was switched. But, was the reason due to the competing fork being longer? Or was it due to the other fork's terminal block having a lower VRF causing the tie-break rule to be applied? Or did the node prefer another fork some other reason?

This information would be really helpful when trying to identify what caused your own block to not be adopted.

TerminadaPool avatar Oct 15 '24 07:10 TerminadaPool

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Nov 15 '24 02:11 github-actions[bot]

Please stay as an open feature request.

TerminadaPool avatar Nov 15 '24 03:11 TerminadaPool

Yes, please: if this is considered impractical to add I would like to hear why. Our SPO operation believes this is important for quality assurance purposes.

rphair avatar Nov 15 '24 11:11 rphair

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Dec 16 '24 02:12 github-actions[bot]

@carbolymer what kind of engagement do the developers expect for this issue not to be Stale? There won't be anything new to report until someone in the relevant team finally responds.

This log info has been long & vitally requested for diagnostic purposes and the need for it is never going to go away. Those who need it most would be small stake pools with infrequent blocks... and therefore needed more in the near future if & when the K PoS parameter increases (as is constantly suggested in governance conversations).

If this issue keeps being dismissed, the implication is that QA isn't very important for stake pools and/or chain density isn't that important for Cardano itself. I don't think either of these things are true and I can state from SPO experience dating back to the beginning of the Shelley period that implementing this feature (which we arguably should have had from day 1) would be helpful for SPOs and node infrastructure operators at all levels of endeavour.

rphair avatar Dec 16 '24 11:12 rphair

There was a slack discussion about this issue some time ago. @jasagredo @amesgen would you mind sharing main takeaways here?

carbolymer avatar Dec 27 '24 11:12 carbolymer

The trace messages already contain enough detail to deduce the reason why a node changes its chain. Concretely, whenever we change our selection, we trace the so called "SelectView" of the tip of both the old and the new chain. This SelectView contains exactly the information that is needed in order to determine whether a chain is preferable to another chain. See PraosChainSelectView for more details and assorted documentation.

In Conway, the chain order works like this, via a lexicographical combinationi of the following criteria, when deciding whether to switch from the current chain to a candidate chain:

  1. compare chain length, preferring larger values
  2. when (same slot and same issuer) compare operational certificate issue number, preferring larger values
  3. when (slots differ by at most 5) compare VRF, preferring smaller values

By comparing the old and new SelectView, one can see which of these three rules was responsible for the chain switch:


Let's look at an example:

{
  "at": "2024-12-13T13:50:58.295305594Z",
  "ns": "ChainDB.AddBlockEvent.AddedToCurrentChain",
  "data": {
    "kind": "AddedToCurrentChain",
    "newTipSelectView": {
      "chainLength": 11214658,
      "issueNo": 16,
      "issuerHash": "8e0bb9b126acd8e65a023602377d9a7a6f15af394d1f7bdd852182b8",
      "kind": "PraosChainSelectView",
      "slotNo": 142531567,
      "tieBreakVRF": "f8f31aabb555124392485cbd7ba02bded4e4eac8980fc36fde0c9a21d7cf2d51c6fd4ddda8113153639db862781dafacbdd77da4e072452a330c32509506762e"
    },
    "newtip": "17d5b0f21e86503b41851640e9558610ebe9b7a62c96a08873eb4d0c7961a460@142531567",
    "oldTipSelectView": {
      "chainLength": 11214657,
      "issueNo": 20,
      "issuerHash": "c63dab6d780a74cbae2a27696c9723f55b3092b2bd001256df03827f",
      "kind": "PraosChainSelectView",
      "slotNo": 142531554,
      "tieBreakVRF": "905ff37db665e86d60d93dabc92ec86124b1a22165e4968facca226e790d9b7dd3f60502fa84dcd30d95ac28a9278e5e72b8b2eb451759ab59c92fc65909a2b4"
    }
  },
  "sev": "Notice",
  "thread": "76",
  "host": "ramify"
}

Here, we can see that the chainLength of newTipSelectView is larger than the one of the chainLength of the oldTipSelectView, so the chain switch happened due to rule 1.

If these would be equal, one would have to look at rule 2 next (and then rule 3) to see whether it is responsible.


For convenience/simplicity, we could further enrich the trace message to directly include eg a reason field that says something like "LongerChain" or "BetterTieBreaker".

cc @mgmeier as you might have already started with this.

amesgen avatar Jan 06 '25 10:01 amesgen

That sort of log information would absolutely suit my needs. I wasn't aware we could get the "tieBreakVRF" values logged ordinarily.

What do I need to configure in my node config.json file so that such information is logged in my systemd journal?

TerminadaPool avatar Jan 06 '25 23:01 TerminadaPool

In both the old and new tracing system, you should see this when you enable JSON output and use a severity of at least Notice for ChainDB.AddBlockEvent.AddedToCurrentChain/ChainDB.AddBlockEvent.SwitchedToAFork.

amesgen avatar Jan 07 '25 10:01 amesgen

Thanks @amesgen for summarizing the discussion we had (BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one). I'll add some more

  1. As mentioned the above trace messages / tracers need to be enabled, and severity filtering should be configured to include messages of serverity Notice.
  2. ~In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.~
  3. We have it on our list to add a convenience field to the trace message, stating a reason for switching to a fork (should that happen). Would this make sense to provide at normal detail level, without knowing the underlying detailed criteria (like chain length, opcert no and VRF)?
  4. Point 3 will likely happen for the new tracing system only, as the old system will be maintenance-only starting with Node release 10.2 - and not receive additional features.

mgmeier avatar Jan 07 '25 12:01 mgmeier

BTW: needed to edit your post to change my GH handle to @mgmeier - the correct one

Thanks for catching that!

  1. In the new system, you'd additionally need to configure the trace messages to be more verbose, by setting their detail level to at least DDetailed.

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail: https://github.com/IntersectMBO/cardano-node/blob/1774c93f9174022f359fab3bfe81d9427e32634a/cardano-node/src/Cardano/Node/Tracing/Tracers/ChainDB.hs#L511-L518

amesgen avatar Jan 07 '25 12:01 amesgen

In the code, it looks like {new,old}TipSelectView is traced even with the default level of detail

Aplogies, yes, the info will be traced at default detail level. I was looking at a local branch where I was toying around with the settings :-|

mgmeier avatar Jan 07 '25 12:01 mgmeier

In both the old and new tracing system, you should see this when you enable JSON output

@amesgen @mgmeier our pool has 4+ years of homebrewed automation driven from text-based output ("ViewMode": "SimpleView").

Does it follow from https://github.com/IntersectMBO/cardano-node/issues/6014#issuecomment-2574907814 that AddedToCurrentChain and SwitchedToAFork are not visible from SimpleView, and that therefore we would have to change to JSON logging to see these events?

rphair avatar Jan 07 '25 12:01 rphair

Which tracing system are you using (i.e., what is your config value for UseTraceDispatcher)?

mgmeier avatar Jan 07 '25 12:01 mgmeier

Whatever the default would be for the node: we don't have this setting in any of our config files.

rphair avatar Jan 07 '25 12:01 rphair

Then it is the old system, as currently it defaults to false. As far as I can see, the info for {new,old}TipSelectView is indeed only available for JSON logging.

However, here's an idea about how to scrape that info without breaking existing automation. In the node config, you could set up an additional scribe to a file, having "scFormat": "ScJson". You would then need to create a script, possibly based on grep and jq, that monitors the trace messages (=JSON objects) going to that file. When you encounter a message indicating switching to a fork, the script could inspect its data and send an event to the systemd journal, to merge with the rest of the logging data in text-based format. If you configure log rotation or somesuch for that file scribe, the script would need to take that into account somehow as well.

Would that work for you somehow?

mgmeier avatar Jan 07 '25 13:01 mgmeier

yes, thanks @mgmeier - to log these events we'll fork off another log as you suggest. 🙏 If whatever changes are contemplated here produce effects in SimpleView I trust we will be able to find out about them here or in a linked thread.

rphair avatar Jan 07 '25 13:01 rphair

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Feb 07 '25 01:02 github-actions[bot]

@TerminadaPool - as the OP, how do you feel about requiring a 2nd JSON format log output to make these events visible? And is this issue resolved as you see it? (the marking Stale as a foot-in-the-door to closing this FR doesn't seem like a proper resolution to me: seems like it should be either closed "Won't Fix" or given a label that keeps it open until the feature is provided).

rphair avatar Feb 07 '25 17:02 rphair

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Mar 11 '25 02:03 github-actions[bot]

@github-actions I am running out of things to say here - cc @mgmeier @amesgen @TerminadaPool

rphair avatar Mar 11 '25 06:03 rphair

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Apr 12 '25 02:04 github-actions[bot]

Hello @github-actions!

rphair avatar Apr 12 '25 17:04 rphair

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar May 15 '25 02:05 github-actions[bot]

@github-actions @mgmeier I do think this issue is still relevant and still hope that it can be catalogued in such a way that we don't have to confirm this every 30 days to avoid the issue becoming "stale" and disappearing.

rphair avatar May 15 '25 04:05 rphair

I don't actually separately configure an extra .json file output for logging but instead use the normal journal output. I would appreciate a normal log message stating the reason for why the node switched tip.

TerminadaPool avatar May 16 '25 08:05 TerminadaPool

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Jun 16 '25 02:06 github-actions[bot]

🎶

rphair avatar Jun 16 '25 02:06 rphair

In node version 10.4.1 I have configured MachineFormat for Stdout backend in TraceOptions:

"TraceOptions": {
    "": {
      "backends": [
        "Stdout MachineFormat",

MachineFormat provides information like this for each block adopted:

Jun 18 18:50:40 relay2 cardano-node[2570632]: {"at":"2025-06-18T08:50:40.790719758Z","ns":"ChainDB.AddBlockEvent.AddedToCurrentChain","data":{"kind":"AddedToCurrentChain","newTipSelectView":"chainLength":12012087,"issueNo":14,"issuerHash":"8d2fef46df5f9b2fcb64ebad739f27062b4f38c04df1a916fa216752","kind":"PraosChainSelectView","slotNo":158670349,"tieBreakVRF":"42e07a46dc4aeccb949caaef8c3341392ae0744a361fd854123d314f37723702950c80615201b90b18276239375a1f02ebe189e1243260347d9cc039941804bb"},"newtip":"6b01316c4536981cb767ff63521aa3d857a83f271e040a89d85471ea5fbc097f@158670349","oldTipSelectView":"chainLength":12012086,"issueNo":26,"issuerHash":"e985b5698de1386439e62b5b7ab2cc9f814b91b4a1fef5970f38694d","kind":"PraosChainSelectView","slotNo":158670327,"tieBreakVRF":"f5d16073b6bad50b1517830e70ef7a252ceada181cd7b4e7f6f35908487e7be14f7859a7f7c2d8997c29e9b23a9a844dd69dd3c03dbc72491877ea550cbd0247"}},"sev":"Notice","thread":"79","host":"relay2"}

It is a bit of a pain to read this json format in the journal, but you can see the tieBreakVRF value. When there is a "switched tip to new fork" event you can see that the node followed the fork with the lower tieBreakVRF.

Unfortunately you don't get the tieBreakVRF values in the logs if you use the human readable formats: HumanFormatColoured or HumanFormatUncoloured. But at least the spelling of "coloured" is correct. :)

TerminadaPool avatar Jun 18 '25 09:06 TerminadaPool

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.

github-actions[bot] avatar Jul 24 '25 02:07 github-actions[bot]