Add "disregarded" errors to zetaclient's metrics system
Motivation and Context Sometimes, zetaclient detects an error, logs it, and purposefully continues execution instead of crashing. These errors can be detected in the logs, but are not so easy to track.
Describe the solution you'd like We should add some of these "disregarded" errors to the metrics system to make them easy to track.
List Here is an (incomplete) list of places where we disregard these kinds of errors. Please add to this list if you find more errors like this.
- In
ProcessOutboundTrackerswhen callingZetaRepo().GetCCTX(...):- For Bitcoin, Solana, Sui, and TON.
- In
ProcessInboundTrackerswhen processing an inbound tracker:- For Sui and TON.
- Bitcoin, EVM, and Solana return the error.
All errors are eventually logged The logging happens either at the function itself, like the SUI or TON examples, or they get logged at Task level
Block ticker
if err := t.exec(ctx); err != nil {
t.logger.Error().Err(err).Msg("Task error")
}
Interval Ticker
wrapper := func(ctx context.Context, t *ticker.Ticker) error {
if err := task(ctx); err != nil {
logger.Error().Err(err).Msgf("Task %s failed", taskName)
}
if intervalUpdater != nil {
// noop if interval is not changed
t.SetInterval(normalizeInterval(intervalUpdater()))
}
return nil
}
In my opinion logging and contributing is fine, but I agree we should follow a common standard, and returning errors and logging at the task level seems like the correct approach to me
Regarding these two examples, though, we should have certain metrics in place to raise alerts I think using a tracker count Outbound: zetacore trackers Inbound : zetacore + internal trackers Which I think we already have (not 100 % sure ).
Clarifying the idea of the issue:
Yes, all errors I've seen are being logged. As you said, there is no consistency as to where they are being logged; if in the task level or inside functions that do not return errors. (This is something I think we could eventually standardize.)
Still, some functions, like the one that processes outbound trackers, should not return an error to the task level.
ProcessOutboundTrackers loops through each outbound tracker, and if one of them fails for whatever reason, we just log the error and continue the loop.
These are what I'm calling "disregarded" errors. They get logged, but I think we should track them better. For example, using the metrics trackers you mentioned.
Which I think we already have (not 100 % sure ).
We may have them for ZetaCore, but not for ZetaClient?
Update based on discusion with @renan061
The core idea is to log/store important debug data in a storage, (most likely a key-value pair ) so that its easier to find it rather than having to go through data dog logs
For example
- List for active internal trackers (inbound hashes )
- List of inbound hashes which failed in any step in between Block Scan and Vote broadcast ( such as at the valdiation step)