cosmos-sdk icon indicating copy to clipboard operation
cosmos-sdk copied to clipboard

Only do telemetry computation if telemetry is enabled in the node

Open ValarDragon opened this issue 4 years ago • 31 comments

Summary

As detailed in other issues, the mutex locks and time.Now() syscalls' taken in telemetry are rather expensive for nodes. In general, nodes not using telemetry should not pay these costs.

Problem Definition

Nodes that aren't using telemetry should not pay the costs of telemetry.

Proposal

Somehow make the telemetry package know the result of its config option detailing whether or not its enabled. Then only do operations when its enabled. This is useful due to potential for mutex lock contention.

This doesn't solve the extraneous time.Now() call, but that should hopefully be solvable via other mechanisms. (e.g. not putting telemetry into hot loops / low level items) This can't be deadcode eliminated / constant folded, since telemetry enablement is a run-time flag, not a compile time flag


For Admin Use

  • [ ] Not duplicate issue
  • [ ] Appropriate labels applied
  • [ ] Appropriate contributors tagged
  • [ ] Contributor assigned/self-assigned

ValarDragon avatar Sep 28 '21 05:09 ValarDragon

ACK, yeah this totally makes sense and TBH was an oversight on my part. I didn't realize how much time these calls took.

alexanderbez avatar Sep 28 '21 14:09 alexanderbez

To improve the time.Now() call, we can make that a telemetry function. So replace time.Now() in the defer statements with telemetry.StartTimer(). And the StartTimer function would be if telemetry enabled { return time.Now() } else { time.Time{} }

ValarDragon avatar Oct 08 '21 22:10 ValarDragon

Does anyone have any guidance on how we can get there to be a flag for whether or not telemetry is enabled? Some thoughts that come to mind for me:

  • Does such a flag belong in the ctx
  • Should it be a global variable in the telemetry package thats set on node startup?
  • Should there be a telemetry object accessible via ctx?

I feel like putting a TelemetryEnabled flag in the context may be the simplest thing thats re-usable, but not at all sure its the right thing to do.

ValarDragon avatar Oct 17 '21 05:10 ValarDragon

@ValarDragon the SDK context type? I think that might be our only option. Alternatively, we have a global variable in the telemetry package.

alexanderbez avatar Oct 18 '21 13:10 alexanderbez

yeah meant SDK context type.

I feel like a global variable in telemetry will make the syntax for doing telemetry calls simpler at least, no idea if it would cause problems for integrators though. cc @jackzampolin

e.g. API with global variable

defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)

API choices with it in context:

if ctx.telemetryEnabled {
defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)`
}

or

defer telemetry.ModuleMeasureSince(ctx, types.ModuleName, telemetry.StartTimer(ctx), telemetry.MetricKeyBeginBlocker)`

I don't see a reason why if your running N tendermint chains, you'd only want telemetry on a subset. Is that an important thing to support?

ValarDragon avatar Oct 19 '21 17:10 ValarDragon

Asked @jackzampolin over DM, he saw no client issue with adding a global variable to the telemetry package, thats set on telemetry initialization. So lets go with that approach! (It has the easier API)

ValarDragon avatar Oct 27 '21 04:10 ValarDragon

can we do what Tendermint does and pass a noop telemetry? or this is about also avoiding time calls?

tac0turtle avatar Nov 16 '21 10:11 tac0turtle

Also avoiding the time.Now() calls.

ValarDragon avatar Nov 16 '21 13:11 ValarDragon

can we do what Tendermint does and pass a noop telemetry? or this is about also avoiding time calls?

Yes

Also avoiding the time.Now() calls.

Yes

alexanderbez avatar Nov 16 '21 14:11 alexanderbez

@marbar3778 @alexanderbez, can you provide a list of actionable items here? Happy to work on this

@marbar3778: don't call time.Now() in the function but move time to a higher level and create a noop telemetry struct

What do you mean by higher level? server?

fedekunze avatar May 21 '22 07:05 fedekunze

I'm not really sure what @marbar3778 is referring to, but you can't really avoid defer timing calls. The idea is that we have a config in the config struct and delegate all telemetry calls to a function that checks that config. Something like:

// in the telemetry package
func Emit(ctx sdk.Context, metricCb func()) {
  if ctx.TelemetryEnabled() {
    metricCb()
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  defer telemetry.Emit(ctx, func() {
    telemetry.MeasureSince(time.Now(), "foo", "bar")
  })
}

alexanderbez avatar May 23 '22 13:05 alexanderbez

it means replacing the time.Now() call with telemetry.StartTimer(), which then does the config check, and if so runs time.Now(), otherwise doesn't run time.Now()

ValarDragon avatar May 30 '22 10:05 ValarDragon

I think my proposal achieves that whilst also being flexible to other things besides time measurements (e.g. counters)

alexanderbez avatar May 30 '22 13:05 alexanderbez

MeasureSInce and associated functions should call time.Now() instead of leaving it to the keeper. This way the noOP metric gatherer can be set without time.now or with.

tac0turtle avatar Dec 29 '22 10:12 tac0turtle

What's the status of this? It sounds like an API change, right? If so, can we agree on the proposed API?

elias-orijtech avatar May 12 '23 19:05 elias-orijtech

this is still relevant, i dont have a particular api in mind, do you have any thoughts?

tac0turtle avatar May 13 '23 08:05 tac0turtle

I'm not really sure what @marbar3778 is referring to, but you can't really avoid defer timing calls. The idea is that we have a config in the config struct and delegate all telemetry calls to a function that checks that config. Something like:

// in the telemetry package
func Emit(ctx sdk.Context, metricCb func()) {
  if ctx.TelemetryEnabled() {
    metricCb()
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  defer telemetry.Emit(ctx, func() {
    telemetry.MeasureSince(time.Now(), "foo", "bar")
  })
}

This design seems compelling, but has the issue that the lazy time.Now will be called when Foo returns, not when entered and so MeasureSince will not measure what you expect.

As detailed in other issues, the mutex locks and time.Now() syscalls' taken in telemetry are rather expensive for nodes. In general, nodes not using telemetry should not pay these costs.

What mutex locks are you referring to? go-metric accesses the global metric instance through atomic loads. A global, say, atomic bool to track the status of metrics would probably cost the same in CPU time as the atomic pointer load of the metric object.

Note that

if ctx.telemetryEnabled { defer telemetry.ModuleMeasureSince(types.ModuleName, telemetry.StartTimer(), telemetry.MetricKeyBeginBlocker)` }

is racy. It's otherwise a reasonable design if telemetryEnabled is made atomic, because we won't need telemetry.StartTimer.

elias-orijtech avatar Jun 06 '23 22:06 elias-orijtech

Good point. We could augment the metricCb to take ...any then.

alexanderbez avatar Jun 07 '23 14:06 alexanderbez

Why not check and load the metric at the same time? Say,

	defer telemetry.ModuleMeasureSince(types.ModuleName, time.Now(), telemetry.MetricKeyBeginBlocker)

becomes

    if m := telemetry.Load(); m != nil {
	defer telemetry.ModuleMeasureSinceLocal(m, types.ModuleName, time.Now(), telemetry.MetricKeyBeginBlocker)
    }

?

It's an extra if-check that can side-step all extra work including time.Now, and doesn't slow down in the telemetry enabled case because there is only the one load of the global metric.

elias-orijtech avatar Jun 08 '23 22:06 elias-orijtech

is there a way to make so this is handled automatically? the if checks will become redundant and annoying to write quickly

tac0turtle avatar Jun 09 '23 12:06 tac0turtle

@elias-orijtech I don't think it should be the caller's responsibility -- poor UX IMO

alexanderbez avatar Jun 09 '23 14:06 alexanderbez

What's the alternative? Can you expand on https://github.com/cosmos/cosmos-sdk/issues/10245#issuecomment-1580885936?

elias-orijtech avatar Jun 09 '23 14:06 elias-orijtech

Well with my proposal, the APIs wouldn't be that much cleaner anyway. I think we might have to have Emit* functions for each type of metrics call supported. Then that Emit* method checks if telemetry is enabled.

e.g.

// in the telemetry package
func EmitMeasureSince(ctx sdk.Context, t time.Time, args ...string) {
  if ctx.TelemetryEnabled() {
     telemetry.MeasureSince(t, args...)
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  t := time.Now()
  defer telemetry.EmitMeasureSince(ctx, t)
}

alexanderbez avatar Jun 10 '23 00:06 alexanderbez

Well with my proposal, the APIs wouldn't be that much cleaner anyway. I think we might have to have Emit* functions for each type of metrics call supported. Then that Emit* method checks if telemetry is enabled.

e.g.

// in the telemetry package
func EmitMeasureSince(ctx sdk.Context, t time.Time, args ...string) {
  if ctx.TelemetryEnabled() {
     telemetry.MeasureSince(t, args...)
  }
}

// some function in keeper
func (k Keeper) Foo(ctx sdk.Contex, ...) {
  t := time.Now()

This design doesn't omit the call to time.Now, as requested by the OP. telemetry.Now was suggested, but that's even more API. Also, the resulting design does 3 atomic loads: 1 atomic for each of telemetry.Now and ctx.TelemetryEnabled to check whether metrics is enabled, and one atomic inside telemetry.Measure... to access the global telemetry object. By comparison my suggestion is 1 atomic load.

I still think my suggestion is simpler overall, although I admit it is somewhat clumsier.

defer telemetry.EmitMeasureSince(ctx, t) }

elias-orijtech avatar Jun 11 '23 23:06 elias-orijtech

can time.Now() be put into EmitMeasureSince?

tac0turtle avatar Jun 12 '23 10:06 tac0turtle

can time.Now() be put into EmitMeasureSince?

EmitMeasureSince is deferred, so its time.Now would be too late.

elias-orijtech avatar Jun 12 '23 14:06 elias-orijtech

  1. You're putting responsibility on the caller via if m := telemetry.Load(); m != nil { ... }
  2. Isn't the time.Now() not evaluated until the deferred call? If so, that won't be accurate.

alexanderbez avatar Jun 12 '23 15:06 alexanderbez

  1. You're putting responsibility on the caller via if m := telemetry.Load(); m != nil { ... }

Yes, that's the clumsiness of my proposal. However, the idiom is almost akin to if err := ...; err != nil.

  1. Isn't the time.Now() not evaluated until the deferred call? If so, that won't be accurate.

No. The arguments to a deferred function call are evaluated eagerly.

elias-orijtech avatar Jun 12 '23 16:06 elias-orijtech

Yeah I just don't see how that's a cleaner API and dev UX personally.

alexanderbez avatar Jun 12 '23 17:06 alexanderbez

Alright. So what's the final API? https://github.com/cosmos/cosmos-sdk/issues/10245#issuecomment-1585302447 plus telemetry.Now?

elias-orijtech avatar Jun 12 '23 19:06 elias-orijtech