nearcore Consolidate performance related logging

Consolidate performance related logging

Open bowenwang1996 opened this issue 3 years ago • 9 comments

Right now we use several different ways to log performance related issues. We have delay_detector, performance_stats, and inside contract runtime some new logging using tracing introduced by @matklad. We should consolidate those into a unified framework so that people don't get confused by all those different tools.

Mar 05 '21 16:03 bowenwang1996

Note that we also already use both log and tracing for logging. The subtasks here are:

[x] replace log with tracing (mostly s/log/tracing, as they are api compatible)
[ ] replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length
[x] replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case
[x] make sure that we have tracing::span! calls where appropriate. Hierarchical logging is massively helpful when debugging issues and perf regressions.

Mar 05 '21 17:03 matklad

Unassigning myself -- clearly, I am not going to get to this anytime soon. Additional check-box:

[x] document nearcore logging best practices here

Jul 08 '21 13:07 matklad

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Oct 06 '21 14:10 stale[bot]

Jan 06 '22 00:01 stale[bot]

A more pressing issue: debug log should be enabled without affecting the performance of a node https://github.com/near/nearcore/issues/6072

Jan 24 '22 15:01 bowenwang1996

replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case

? That’s sounds exactly backwards to me. If something can be a metric it’s better as a metric which can be collected and aggregated for monitoring.

Jan 25 '22 18:01 mina86

Yeah, that was poor wording. What I've meant to say is that tracing is basically a generic front-end for emitting events, and the sink can be whatever.

Basically, I want us to write something like the following (pseudo-runtime)

fn process_receipt(receipt: Receipt) {
  let _span = tracing::info_span("process_receipt", hash = receipt.hash);
  if receipt.is_action_receipt() {
    tracing::event!(action_receipt, n_actions = receipt.actions.len());
    for action in receipt.actions {
        let _span = tracing::info_span("process_action", deadline = 100ms);
    }
  }
}

and get, by using appropriate subscribes, all of the following:

stderr logs for important event
stream of events sent to some kind of structured event aggregator which shows graphs for receipt processing times, and allows to quickly answer questions like "what are the hashes of the receipts that took longest to process in the last 24 hours"
stderr warnings for spans which took longer than we'd expect them to.

The benefits I see here:

unified interface for emitting all kinds of events
hierarchical info attached to all events: tracing keeps track of which spans were active when this or that event occured, and that should allow us to trace outliers we see in metrics to specific inputs when caused them

But, again, for metrics in particular, I am not entirely sure this will work (my main concern being performance -- metrics in general support pre-aggregation in process, while tracing is build around shipping the full stream of events and leaving aggregation to the consumer. But then again a subscriber can pre-aggregate in process and ship compressed data externally).

Jan 25 '22 19:01 matklad

I have used tracing + a event/span ingest as well as manually instrumented prometheus metrics in the past. Having event/spans can produce very awesome and useful results. It becomes possible to figure out issues at a granularity of a single unit of work, see at what exact point the issues manifested themselves, etc. They also:

require to plan a fleet-wide guidelines for event structure[^1];
require enforcement of these guidelines somehow; and
and is prone to producing amounts of data that quickly spiral out of control (What a nice 10G uplink? Would be a shame to not stuff it full of events and spans.)

On one hand the TBs of daily ingest never really presented much of a roadblock for the application's ability to perform. tracing and opentelemetry are just that good. This amount of events does imply pouring money into a provider of distributed ingest service by bathtubfuls. Or making it somebody's full time job to maintain the ingest infrastructure.

If we're looking to have a fully featured event/span ingest, we should have a plan for how we'll manage the infrastructure (and so somebody dealing with infrastructure should be part of these discussions). We'd also need to document this for other people running their validators. Otherwise, all the work going into instrumenting nearcore with these events and spans will only serve to produce a more of the nice logs.

In contrast, simple prometheus metrics, never really presented any problems related to amount of data, even when not doing too much to keep the cardinality under control. But when I did hit a problem, it was more likely that I wouldn't have sufficient data around the problem area to point me at a cause more precisely than just by giving me a broad guess.

[^1]: so that the different events/spans can be easily associated with each other.

As a side note https://github.com/near/nearcore/issues/6072 was closed, but I believe the issue is still relevant, so another checklist item:

[ ] Improve our subscriber setup to reduce the performance impact of logging and otherwise sending events/spans.

Mar 07 '22 18:03 nagisa

@nagisa I reopened #6072

Mar 09 '22 01:03 bowenwang1996

This item is important. If we can can rid of delay detector, that will be a win:

replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length

Created #8186

Dec 08 '22 09:12 nikurt

All outstanding sub-issues were fixed.

Sep 20 '23 13:09 nikurt

nearcore nearcore copied to clipboard

Consolidate performance related logging

nearcore
nearcore copied to clipboard