opentelemetry-java-contrib icon indicating copy to clipboard operation
opentelemetry-java-contrib copied to clipboard

SpanMetricsProcessor for agent instrumentation

Open anuragagarwal561994 opened this issue 1 year ago • 9 comments

Is your feature request related to a problem? Please describe. Opentelmetry java instrumentation is nicely able to handle the application traces using annotations, a nice feature to have would be to record metrics of the spans using the annotations itself. The metrics would be of the way like span metrics processor does to record the R.E.D metrics.

Describe the solution you'd like We can use the current span annotations to make a histogram. Histogram itself will give the counter metrics. We can define two fields in the annotation, one to disable metrics for a particular method and another to add a gauge for the corresponding method. The whole feature can be disabled by default and can be enabled on demand by using a configuration option. Attributes of a span can be used as labels and exceptions can be used as the corresponding error metrics (can be thought more on the same). These metrics can be exposed using the prometheus metrics exporter.

Describe alternatives you've considered We have currently been using the span metrics processor at the otel collector end for the same purpose where the traces are being pushed and the metrics are consolidated using the processor.

But these metrics are being impacted by the trace sampling set on client. If we make tracing 100% then it can impact performance. Also most of the times the consolidation can easily be handled at the client end easily and further consolidation can take place at prometheus, like most of the applications run today.

Even if we are using a dummy Span (in case of sampled trace), I believe we can record metrics (I may be wrong here), thus allowing us to not push traces 100% of the times yet getting accurate metrics data.

This would further enhance user experience and allow them to record application metrics without much hassel.

Additional Context A lot of open source tools are adopting span metrics processor for showing R.E.D metrics along side their traces few examples include

Grafana Signoz Jaegar

anuragagarwal561994 avatar Aug 14 '22 14:08 anuragagarwal561994

sounds great

CoderPoet avatar Aug 15 '22 13:08 CoderPoet

I can contribute for the same, would need some guidance and proposal approval.

anuragagarwal561994 avatar Aug 15 '22 16:08 anuragagarwal561994

hi all, check out related discussion at https://github.com/open-telemetry/opentelemetry-java/pull/4260

trask avatar Aug 15 '22 20:08 trask

Hello again @trask, so I checked the discussion you referred, let me summarise the above discussion for others to follow, do let me know if I missed anything:

  1. https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/7030 this issue is same requirement but slightly different approach
  2. A PR was raised https://github.com/open-telemetry/opentelemetry-java/pull/4260 to add the extra annotations in the opentelemetry-java repo
  3. However the annotations at that time were being moved to another repo https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation-annotations
  4. The PR was closed because it did no longer belong in the opentelemetry-java repo https://github.com/open-telemetry/opentelemetry-java/pull/4260#issuecomment-1213297876

So if we can finalise what proposal works best for this feature, I suspect it to be a mix of both the approaches suggested by me and @fstab, this will give users more granular control. Both @fstab and I can contribute to this change.

anuragagarwal561994 avatar Aug 16 '22 00:08 anuragagarwal561994

@anuragagarwal561994 yes I agree with your summary 👍

trask avatar Aug 16 '22 01:08 trask

@trask what do you think should be the right approach here. First let's list down the requirements:

Behaviour Requirements:

Must Haves

  • We need to record metrics from the opentelemetry-java-instrumentation
  • User should be allowed to choose which methods to monitor
  • User to should allowed to choose to disable the functionality
  • These metrics should be exposed along with the open telemetry metrics exporter
  • Metrics should have span attributes as labels
  • Metrics control should be independent of tracing configuration / sampling at the agent's end
  • We should have metrics such as counter (call counts), gauge (concurrency), histogram (latency mainly)
  • Pod / Instance classification
  • Service classification
  • Metric name definition

Good to Have

  • User should be allowed to configure further granularities (this may only be applicable for histogram)
  • Histograms can further be configured to record on return values (but not sure how this can be simplified like request / response size)
  • User should be allowed to add a common attribute to these requests
  • Errors / Exceptions should be recorded as separate metrics

Let me know if I missed something or something else can be included here.

Approaches:

There are currently 2 approaches for recording metrics.

Single Annotation Approach We can use @WithSpan annotation and use the same to record histogram. Counter can be exposed separately or can be used from _count metric of histogram. Gauge can define the concurrency of the method.

Functionality to disable this feature can be defined in otel.properties.

Feature will be disabled by default.

Enabling it will start recording their metrics everywhere.

Further control can be given to the user using another annotation like @WithMetrics kind of annotation where user can enable / disable the metrics and further define finer grain controlled. If this annotation is missing and the feature is enabled we can assume that the user wants to record everything with the defaults.

Multiple Annotation Approach We can define explicitly different annotations like @Gauge, @Meter, @Counter, @WithMetrics. The last annotation is basically combination of the first three annotations with their defaults.

This is more or less as suggested by @fstab.

Again the feature control via otel.properties will remain same.

This approach will only record metrics where the user intends to.

We can further even try to find out a middle ground between the two approaches, making the life of users simpler.

Pod / Instance & Service classification can be handled by prometheus exporter I guess and should not be the responsibility of this feature.

Metric name definition can be done in the following ways:

  • We can either use span names by default and append _counter, _gauge or whatever the standards say
  • With annotations we can allow the user to explicitly define the metric names, although this would prefer the second approach as we have different annotations and it will be easier for the user to declare
  • We can use the method name, which I believe is anyways the name of the span so it might coincide with the first point

@trask let me know how this all sounds, what can we further think of and which is the preferred way the community prefers.

anuragagarwal561994 avatar Aug 16 '22 17:08 anuragagarwal561994

@trask did you get a chance to go through the above?

anuragagarwal561994 avatar Aug 18 '22 05:08 anuragagarwal561994

I am wondering why this isn't proposed earlier. Check what Datadog did with their Trace Metrics and how awesome it is. By the way, I agree that it would be nice if users are able to choose which methods to be monitored, but by default it should be all (enabled) or none(disabled). When enabled, users should be allowed to disable certain metrics (for example, health check metrics). Datadog Trace Metrics should provide a great example.

michaelzhd avatar Mar 08 '23 23:03 michaelzhd

hey @anuragagarwal561994, I think this SpanMetricsProcessors would make a good contribution to the contrib repo, i'm going to transfer it over there. It can then be used as an agent extension (or outside of the Java agent) as needed.

trask avatar Aug 24 '23 04:08 trask