brod icon indicating copy to clipboard operation
brod copied to clipboard

Dispatch the metrics via telemetry

Open slashmili opened this issue 2 years ago • 20 comments

Hello!

I'm wondering if you are open to a contribution to introduce telemetry into this project?

There lots of important metric that I'd like to tap into especially when consumer connect and rebalance and would be a great addition to this library.

slashmili avatar May 11 '22 11:05 slashmili

Hi @slashmili

It'd have to be configurable though. WDYT? @ieQu1 @qzhuyan @id and @kjellwinblad @starbelly @vikas15bhardwaj inputs are appreciated.

zmstone avatar May 11 '22 14:05 zmstone

@zmstone configureable to disable it completely? Ok fair enough.

Do you prefer it to be done via Macros or via a condition that only runs during runtime? I'm asking since if Macro is they way to go, I need to do some study first.

slashmili avatar May 11 '22 14:05 slashmili

Hi @slashmili Ideally the dependency should be optional for users who do not need beam-telemetry. We can probably introduce a brod_telemetry module to confine the telemetry dependencies and use a macro to control if it's a no-op or call telemetry instead. e.g.

-ifndef(BROD_USE_TELEMETRY).
execute(_Name, _Measurements, _Metadata) -> ok.
-else.
execute(Name, Measurements, Metadata) -> telemetry:exeucte(Name, Measurements, Metadata).
-endif.

zmstone avatar May 11 '22 16:05 zmstone

Telemetry in brod would be a big win 👍 💯

starbelly avatar May 13 '22 13:05 starbelly

I'd love to help on this wherever I can. I think librdkafka's STATISTICS page could be helpful when trying to figure out exactly what metrics should be emitted.

spencerdcarlson avatar May 13 '22 15:05 spencerdcarlson

@zmstone regarding

Hi @slashmili Ideally the dependency should be optional for users who do not need beam-telemetry.

Does this actually need to be the case? I thought by nature telemetry uses dynamic dispatching to delegate the execution of the handler or "work" process to the the subscriber so the framework has an "optional" built in. If you subscribe to the events then you "opt in" and some overhead is added, if you do not subscribe, then there is no handler for the given event and there is no overhead added.

It looks like the only overhead added to brod would be an ets lookup to see if a handler for an event exists -- This seems like a very small cost to pay in favor of the added complexity of an optional dependency and macro.

  • emit event - If there is no handler for an event then lists:foreach(ApplyFun, Handlers). will not even execute since Handlers =:= []
  • handler lookup - returns []

I don't see telemetry as an optional dependency for any large packages like ecto_sql which can have an extremely high throughput.

spencerdcarlson avatar May 13 '22 17:05 spencerdcarlson

I’m thinking about it more from dependency hygiene perspective but not concerned about performance.

Given it’s essentially two apis to integrate, I assumed it’s not a big effort to implement.

Btw, the compression libs are also made optional (at runtime).

One org may use beam-telemetry, another may have a different. Being able to “plug” different instrumentation code at compile time makes it easy to extend in the future, also less intrusive to the wrapping project’s dependency management

zmstone avatar May 14 '22 06:05 zmstone

Hi @zmstone ! I think the purpouse of the beam-telemetry is to be agnostic of the handling details. It is just a dispatch library that is maintained by the community and endorsed by the observability working group of the ErlEF https://github.com/erlef/observability-wg .

I think that it's goal is exactly to be this standard glueing layer of telemetry. It seems to be aligned with your goal of dependency hygiene (if no handlers are provided, it is almost a pass-through call with no side-effect) and to allow consumers to choose which instrumentation library. From the ones I know it seems they all integrate nicely with telemetry.

Compression is a different case as there is no standard endorsed by the ErlEF currently.

What do you think?

victorolinasc avatar May 24 '22 12:05 victorolinasc

fair enough, Make sense to have the instrumentation code defined as BROD_ prefixed macros?

So a wrapping project still has the chance to change to other calls without going though beam-telemetry

zmstone avatar May 24 '22 18:05 zmstone

TBH I don't mind wrapping around a brod_telemetry module either way works for me. telemetry has 3 functions we are interested in :

-module(brod_telemetry).

-export([execute/2,
         execute/3,
         span/3]).

execute(EventName, Measurements, Metadata) -> telemetry:execute(EventName, Measurements, Metadata).
execute(EventName, Measurements) -> telemetry:execute(EventName, Measurements).
span(EventPrefix, StartMetadata, SpanFunction) -> telemetry:span(EventPrefix, StartMetadata, SpanFunction).

What I've been struggling to do are :

  1. Where to add these telemetries. At @highmobility we are interested in brod_group_coordinator activities. Like rebalancing and the time it takes to join the group and so on. I suppose others have interest in other parts. I was thinking to start introducing telemetry in places I'm familiar with and then the rest could be added as separate PRs by others(and me ofc).

let me know if you think the first PR should cover telemetry in any other module.

  1. The way to use telemetry:

We can use telemetry in two ways: A. telemetrye:execute B. telemetrye:span

While execute is easy and requires less change in the code. I found myself most of the time reaching for span. for example this block of code using span will be like : FROM:

do_connect(Endpoint, State) ->
  ConnConfig = conn_config(State),
  kpro:connect(Endpoint, ConnConfig).

TO:

do_connect(Endpoint, #state{client_id = ClientId} = State) ->
  StartMetadata =  #{client_id => ClientId},
  brod_telemetry:span(
    [brod, client, connect],
    StartMetadata,
    fun() ->
      ConnConfig = conn_config(State),
      Result = kpro:connect(Endpoint, ConnConfig),
      Metadata = maps:merge(StartMetadata, connect_result_to_metadata(Result)),
      {Result, Metadata}
    end
  ).

connect_result_to_metadata({ok, _}) -> #{status => ok, reason => nil};
connect_result_to_metadata({error, Reason}) -> #{status => error, reason => Reason}.

It's easier to use span since it does keep track of time and also triggers: start, stop and exception if it happens.

do_connect/2 is really a small function, imagine if it's a more complex function. To keep the code simple I have to move a lot of codes from their main function to another function. e.g brod_group_coordinator:join_group/1 becomes like :

join_group(#state{ groupId                   = GroupId
                   , client                  = ClientId
                 } = State0) ->
  StartMetadata = #{client_id => ClientId, group_id => GroupId} ,
  brod_telemetry:span(
    [brod, group_coordinator, join_group],
    StartMetadata,
    fun () ->
        Result = do_join_group(State0),
        {Result, maps:merge(StartMetadata, join_group_to_metadata(Result))}
    end).

join_group_to_metadata({ok, #state{
                memberId     = MemberId
                , leaderId     = LeaderId
                , generationId = GenerationId
                , members      = Members
                }}) ->
  #{
    member_id => MemberId
    , leader_id => LeaderId
    , generation_id => GenerationId
    , members => Members}.

do_join_group....

The code becomes too much noisy that I know I'll hate my PR already 😭

What do you think? Is this kind of changes are inevitable ? or do you have an other suggestions?

slashmili avatar May 25 '22 19:05 slashmili

Just looking at ecto/db_connection suite as a reference — their approach seems to be to time the relevant parts and emit the timed event with execute. sql.ex#L1101-L1103

spencerdcarlson avatar May 27 '22 20:05 spencerdcarlson

Thanks for the pointer!

I think it was kinda easier choice for Ecto since it already has the data for logging.

I also looked at: broadway which uses only span Finch which uses execute to simulate span like events

Ok so it's not uncommon to prevent messing with the code run execute with start, stop manually!

I'll take that approach then

slashmili avatar May 28 '22 03:05 slashmili

This would be a very welcomed feature and would also help integrate open telemetry tracing similar to how they do the oban integration.

btkostner avatar Jun 13 '22 19:06 btkostner

@slashmili let me know what you think of https://github.com/kafka4beam/brod/pull/512. Just starting to take a very light stab at adding telemetry.

spencerdcarlson avatar Jun 19 '22 04:06 spencerdcarlson

On the topic of telemetry (the general term, not the library) there is also a need for propagating distributed trace context in kafka messages and instrumentation with telemetry does not help with this.

So if adding some abstraction that maybe calls telemetry depending on user configuration, it'd be great if there is also a way for the user to set a callback or middleware for updating messages being produced so they can add the context as kafka headers.

If a non-telemetry based way to instrument was provided then we'd probably also create an OpenTelemetry instrumentation instead of doing brod->telemetry->opentelemetry. But having to use a different module for calls to make them instrumented is annoying since it means you can't instrument libraries you are using without forking and changing them.

The way OpenTelemetry tries to make integration zero-cost for a user who doesn't use it is the separation of the API and SDK. Including only opentelemetry_api as a library, like brod or a new app opentelemetry_brod would do, results in calls to start spans or record metrics are no-op's. If the user includes the SDK, opentelemetry, then it hooks itself up so those calls actually do something.

And just in case, the Otel semantic conventions for Kafka may be of use https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/instrumentation/kafka.md

tsloughter avatar Jun 23 '22 20:06 tsloughter

Thanks you all for the ideas. I have pushed a branch dev/telemetry, let's start instrumenting the code in this branch, when everybody is happy with the instrumentations, we can merge it to master and bump a minor.

@tsloughter to make the instrumentation nearly no-op for people who do not need it, my original naive idea was to use a compile flag to be switched on by the wrapping project.

I only had a quick look at opentelemetry_api, do I understand it correctly that: the overhead for someone who does not need it is a few reads from persistent_term?

zmstone avatar Jul 08 '22 20:07 zmstone

Hi again @tsloughter I am not quite sure what would opentelemetry_brod do, or why you brought it up, is there any reason why someone should not include opentelemetry_api but create a opentelemetry_appname instead ? There is probably a link somewhere I can read up? :D

zmstone avatar Jul 08 '22 20:07 zmstone

@zmstone compile time flag can be nice for some, but being able to switch between no-op and active without recompiling is useful/easier for some.

And yes, the opentelemetry_api macros will lookup the tracer to use with persistent terms and get the no-op tracer in the cases that there is no SDK installed.

There isn't a reason to not include opentelemetry_api. It is just common to have an opentelemetry_appname for various reasons, like the author not wanting to tie themselves to a particular library or just to have it faster/easier to iterate on the instrumentation.

tsloughter avatar Jul 11 '22 18:07 tsloughter

Any updates on this? It would be very valuable.

Graborg avatar Mar 06 '24 09:03 Graborg

Just recently I picked kaffe which steps on brod and I was wondering if OTel is still considered or in progress?

dimitarvp avatar Mar 09 '24 04:03 dimitarvp