envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Event Tracking

Open snowp opened this issue 2 years ago • 9 comments

When debugging complex issues, the existing logging and stats observability is often inadequate as it will often fail to incorporate enough information to understand what is going on.

To aid these debugging scenarios, an event tracking library would be very useful.

As a first step, something that looks very similar to logs could provide some immediate value:

EMIT_EVENT(dns_resolution_complete, “dns_name”, name_);

if (EVENT_ENABLED(dns_resolution_failure)) {
  // ... do expensive calculation
  EMIT_EVENT(dns_resolution_failure, info); 
}

This would emit events very similar to how we emit logs today, but provide more stable identifiers instead of free form strings that might change as code is refactored. The idea of using an event name also allows for toggling of individual events, which provides more coarse grained flexibility compared to logging.

The events would be propagated to a sink similar to how logs are, using an interface that may look something like

void onEvent(absl::string_view name, std::initializer_list<std::string> info);

Future iterations of this could include adding support for scopes, potentially via ScopeTrackedObject or via other event specific scopes. These scopes could then also provide timings, powering a lot of perf related debugging (e.g. https://github.com/envoyproxy/envoy/issues/16988)

A strawman API with support for scopes to get things going:

struct EventContext {
  const ScopeTrackedContext* context_;
  std::vector<EventScope> event_scopes_;
};

class EventSink {
  virtual void onEvent(absl::string_view name, const std::vector<std::string>& info, const EventContext& context) PURE;
  virtual void onScopeStarted(absl::string_view name, const std::vector<std::string>& info, const EventContext& context, uint64_t scope_id) PURE;
  virtual void onScopeEnded(absl::string_view name, const std::vector<std::string>& info, const EventContext& context, uint64_t scope_iD) PURE;
};

auto scope_handle_ptr = START_EVENT_SCOPE("dns_resolution", "host", address);
EMIT_EVENT("dns_failure", "reason", reason";

@rojkov @jmarantz @mattklein123

Happy to bring this to a Google doc, but figured I'd gather some immediate feedback before writing up a more complete proposal.

snowp avatar Aug 17 '21 15:08 snowp

Have you looked at Perfetto? I think your description feels very similar to it. E.g. events from the same source could be annotated with an additional param to make them distinguishable.

I've submitted https://github.com/envoyproxy/envoy/pull/17591 for this.

rojkov avatar Aug 18 '21 07:08 rojkov

I hadn't looked into it, I had assumed it could be used for the perf part but I imagined the generic event emission might not be covered. Looking into it now though I'm wondering if we could implement these as immediately closing scopes or just as additional annotations on the perfetto scopes. Let me play around with it a bit for our use case and I'll see if it's a good fit.

snowp avatar Aug 18 '21 13:08 snowp

Looking into it I think this would be a great option, though the fact that it is not supported on iOS is a bit of a limitation for usages in envoy-mobile (where both me and @junr03 are coming from when asking for this kind of support).

https://groups.google.com/a/chromium.org/g/tracing/c/DDQGRl3ICW0 seems to imply that it's not that much work required to get it working on iOS, it just requires some prioritization. There seems to be a Google internal issue tracking this, not sure what the state of that is. @alyssawilk maybe you can check in on it?

snowp avatar Aug 18 '21 14:08 snowp

You are right.

I thought iOS was supported, looked at the code and found this

#if PERFETTO_BUILDFLAG(PERFETTO_OS_IOS)
    PERFETTO_FATAL("Data source TLS not supported on iOS, see b/158814068");
#endif

rojkov avatar Aug 18 '21 15:08 rojkov

Given lack of iOS support I think we may want to opt for something more lightweight to support just the event tracking use case: no timings, just a simple way to notify a sink that a certain event has happened.

https://github.com/envoyproxy/envoy/pull/17772 demonstrates a simple approach of reusing the log sink to pass some structured data to log sink, bypassing spdlog to avoid round tripping through a formatted string. Would love some feedback on this to see if this direction seems viable.

My thinking is that we can start with something simple and then extend the functionality as necessary, adding support for more robust perf tooling (via #16988) independently of this.

snowp avatar Aug 19 '21 14:08 snowp

I don't recall if we had an open issue (@DavidSchinazi?) and I think it's pretty far back on our queue but if this is for debugging event flow it's worth getting the take of @RyanTheOptimist :-)

alyssawilk avatar Aug 19 '21 18:08 alyssawilk

Oh this is great! I'm excited to see this move forward. This Event Tracking functionality is very similar in spirit to what Chrome has in the NetLog. Having something similar here would be awesome. In particular, Chrome's NetLog has a bit of structure to it that would be handy. More information on that design can be found here:

https://sites.google.com/a/chromium.org/dev/developers/design-documents/network-stack/netlog

In particular, events can be linked to one another via their ID which allows people looking at logs to, say, start at a URL_REQUEST and eventually drill down to the associate DNS_TRANSACTION, for example. Having some sort of linking like this would be great and would allow us to build some UI (HTML or command line) for querying these traces.

RyanTheOptimist avatar Aug 19 '21 22:08 RyanTheOptimist

@thhous-msft / @anrossi might be able to chime in since they worked on cross platform event tracing for https://github.com/microsoft/msquic

it would be awesome if we could get the event tracing to work with windows ETW as well

davinci26 avatar Aug 20 '21 21:08 davinci26

Are we able to make any progress?

simha5009 avatar Sep 19 '22 15:09 simha5009