tempo icon indicating copy to clipboard operation
tempo copied to clipboard

[Metrics-generator] Add option to include dimensions from span events

Open faustodavid opened this issue 2 years ago • 6 comments

Is your feature request related to a problem? Please describe. Some instrumentations include span events (logs in tempo UI) with useful information, that we could have metrics on them so then we could create aggregations and see the rate and latencies.

For example, @opentelemetry/instrumentation-document-load adds the following events to the span:

image

With this information, we could create a dashboard and show those latencies and quickly identify issues on those events. Like slow firstPaint, firstContentFulPaint, etc.

Describe the solution you'd like Having an option in the metrics generator to read those events and use their latency instead of the span duration.

This probably will be a new metric, next to spanmetrics and service_graph. If it is a new metric, we should be allowed to include dimensions from the span as well.

faustodavid avatar Jun 23 '22 14:06 faustodavid

Suppose I get a green light to implement this feature. I wonder what would be the best approach.

We could create a new metrics processor that can allow us to select dimensions from span attributes and event attributes.

metrics_generator:
  processor:
    span_metrics_events:
      dimensions:
        from_events:
          - message
        from_attributes:
          - service.version
  • It will register the event if at least one element from the event attributes match the provided from_events list.
  • The event's attribute names will be suffixed with _event.
  • The metric duration will be the event duration.
  • Use the same default dimensions as span_metrics.
  • Metric name traces_spanmetrics_events_latency

Example:

traces_spanmetrics_events_latency_bucket{span_name="documentLoad", span_kind="SPAN_KIND_INTERNAL", span_status="STATUS_CODE_UNSET", service_version="3.1.0", message_event="firstPaint", le="0.512"}  1 # {TraceID="..."}

Another option might be to include these changes in the span_metrics processor and suffix the span_name with _event when we are recording events, not the span duration.

Example:

traces_spanmetrics_latency_bucket{span_name="documentLoad_event", span_kind="SPAN_KIND_INTERNAL", span_status="STATUS_CODE_UNSET", service_version="3.1.0", message_event="firstPaint", le="0.512"}  1 # {TraceID="..."}

Or maybe another idea 🙂 ?

faustodavid avatar Jun 27 '22 17:06 faustodavid

This is a really interesting proposal! I can see how it will be valuable to extract additional data out of the span events. I'm concerned about the active series this could generate, but this is something the user will have to figure out and configure appropriately any way.

If I understand your proposal correctly it should be possible to add this to the existing span metrics processor. We could add a configuration option to record certain events.

metrics_generator:
  processor:
    span_metrics:
      dimension:
        - service.version
      event:
        - message

If you don't set this, we simply don't look at the events. We already inspect every span one-by-one here, so it shouldn't be too hard to also loop through the events and emit metrics for those.

I'd be in favour of extending the existing span metrics by adding a label. Though, I'm not sure that will be possible since you have to be able to distinguish between the latency of the overall span and the events.

A couple of questions to check my understanding:

  • Say you include the event message, this event has multiple values (fetchStart, unloadEventStart,...). Will every value be a separate serie? That is, you have a serie with message="fetchStart", one with message="unloadEventStart", etc.
  • What happens if the same event is present twice in the same span? Would you count it twice as well?
  • Will the latency of each serie be the timing of the event? Reading the spec, it's possible to set custom timestamps resulting in negative latencies. I don't think Prometheus histograms support negative values (or they behave weird at least).

kvrhdn avatar Jun 27 '22 17:06 kvrhdn

I like how looks the configuration that you are proposing, is clean and without breaking changes.

I'm concerned about the active series this could generate, but this is something the user will have to figure out and configure appropriately any way.

Yes, I can see how in some cases this can become an issue. I think once https://github.com/grafana/tempo/issues/1482 is implemented we could use similar logic to filter out events.

Though, I'm not sure that will be possible since you have to be able to distinguish between the latency of the overall span and the events.

Can we maybe distinguish the latency of the span and the event by adding a suffix to the span name when we are using the latency from the events?

Ex. span_name="documentLoad_event"

Say you include the event message, this event has multiple values (fetchStart, unloadEventStart,...). Will every value be a separate serie? That is, you have a serie with message="fetchStart", one with message="unloadEventStart", etc.

Yes, I was thinking each event will generate a new series if it has at least one attribute from the provided list.

What happens if the same event is present twice in the same span? Would you count it twice as well?

Yes, if it is twice, we will record it twice. We could maybe add a position index? but I am not totally sure if this will add value.

Will the latency of each serie be the timing of the event? Reading the spec, it's possible to set custom timestamps resulting in negative latencies. I don't think Prometheus histograms support negative values (or they behave weird at least).

Good catch! indeed, it accepts negative values. Could we specify that we support values equal to or bigger than 0, otherwise we can ignore the event or set it to zero?

faustodavid avatar Jun 28 '22 13:06 faustodavid

I like the proposal and would like to see this feature move forward, but have some thought regarding the generated metrics:

I'm not convinced that events are a great candidate to derive histograms from. They're events, they don't necessarily measure anything. It could be misleading to think that something took span start ts - event ts - for example, a TLS renegotiation event could not be tied to the span.

And even if they measure a process, they aren't always tied to the span itself. From the example listed, unloadEventStart and unloadEventEnd are very coupled together, more than to the span itself. They look like spans disguised as events. I worry if these histograms are going to be useful or mainly consume active series.

Is there a way we could build or query histograms to get useful information from events? It's possible I'm missing use cases here.

However, I think counters could be very useful. Knowing how many times a certain event happened for a particular span can be of a lot of use! Extending the current metrics with a label event or message fits perfectly for this case, more than recording the event via span_name.

mapno avatar Jul 04 '22 13:07 mapno

Good points, what I was trying to do is to get the latencies of the events from the first picture.

But, indeed, not all span events might make sense to measure.

Let's say we implemented https://github.com/grafana/tempo/issues/1482, maybe using something like https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/internal/coreinternal/processor/filterspan/filterspan.go, then we could use it also to filter the events and just select the ones that make sense for the users to measure, what do you think?

metrics_generator:
  processor:
    span_metrics:
      dimension:
        - service.version
      event:
        - message
        include:
           match_type: strict
           attributes:
             - Key: message
               Value: firstPaint

faustodavid avatar Jul 04 '22 14:07 faustodavid

Sure. Some filtering could help with cardinality in general 👍 I can see both allowlists and denylists being useful here. Probably we should think how #1482 could help events as well.

mapno avatar Jul 06 '22 13:07 mapno

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Nov 14 '22 00:11 github-actions[bot]