node icon indicating copy to clipboard operation
node copied to clipboard

Add "disregarded" errors to zetaclient's metrics system

Open renan061 opened this issue 2 months ago • 3 comments

Motivation and Context Sometimes, zetaclient detects an error, logs it, and purposefully continues execution instead of crashing. These errors can be detected in the logs, but are not so easy to track.

Describe the solution you'd like We should add some of these "disregarded" errors to the metrics system to make them easy to track.

List Here is an (incomplete) list of places where we disregard these kinds of errors. Please add to this list if you find more errors like this.

  • In ProcessOutboundTrackers when calling ZetaRepo().GetCCTX(...):
    • For Bitcoin, Solana, Sui, and TON.
  • In ProcessInboundTrackers when processing an inbound tracker:
    • For Sui and TON.
    • Bitcoin, EVM, and Solana return the error.

renan061 avatar Oct 08 '25 15:10 renan061

All errors are eventually logged The logging happens either at the function itself, like the SUI or TON examples, or they get logged at Task level

Block ticker

if err := t.exec(ctx); err != nil {
    t.logger.Error().Err(err).Msg("Task error")
}

Interval Ticker

	wrapper := func(ctx context.Context, t *ticker.Ticker) error {
		if err := task(ctx); err != nil {
			logger.Error().Err(err).Msgf("Task %s failed", taskName)
		}

		if intervalUpdater != nil {
			// noop if interval is not changed
			t.SetInterval(normalizeInterval(intervalUpdater()))
		}

		return nil
	}

In my opinion logging and contributing is fine, but I agree we should follow a common standard, and returning errors and logging at the task level seems like the correct approach to me

Regarding these two examples, though, we should have certain metrics in place to raise alerts I think using a tracker count Outbound: zetacore trackers Inbound : zetacore + internal trackers Which I think we already have (not 100 % sure ).

kingpinXD avatar Oct 09 '25 15:10 kingpinXD

Clarifying the idea of the issue:

Yes, all errors I've seen are being logged. As you said, there is no consistency as to where they are being logged; if in the task level or inside functions that do not return errors. (This is something I think we could eventually standardize.)

Still, some functions, like the one that processes outbound trackers, should not return an error to the task level. ProcessOutboundTrackers loops through each outbound tracker, and if one of them fails for whatever reason, we just log the error and continue the loop.

These are what I'm calling "disregarded" errors. They get logged, but I think we should track them better. For example, using the metrics trackers you mentioned.

Which I think we already have (not 100 % sure ).

We may have them for ZetaCore, but not for ZetaClient?

renan061 avatar Oct 09 '25 16:10 renan061

Update based on discusion with @renan061

The core idea is to log/store important debug data in a storage, (most likely a key-value pair ) so that its easier to find it rather than having to go through data dog logs

For example

  • List for active internal trackers (inbound hashes )
  • List of inbound hashes which failed in any step in between Block Scan and Vote broadcast ( such as at the valdiation step)

kingpinXD avatar Oct 15 '25 16:10 kingpinXD