tempo icon indicating copy to clipboard operation
tempo copied to clipboard

Metrics-generator should filter out spans based upon attributes

Open kvrhdn opened this issue 2 years ago • 5 comments

Is your feature request related to a problem? Please describe. If the metrics-generator is enabled for a tenant, it will generate metrics for all traces that are ingested. In some cases it would be nice if we could only generate metrics for a subset of spans.

Example:

  • don't collect metrics for traces that don't have the attribute http.code
  • only generate metrics for spans from service XXX

Describe the solution you'd like It should be possible to configure filters that include/exclude spans based upon their attributes.

Describe alternatives you've considered There is no alternative AFAIK. You could drop the metrics when remote writing them, but that is definitely not ideal.

Additional context

kvrhdn avatar Jun 09 '22 17:06 kvrhdn

What do you think of having filters a bit similar to the OpenTelemetry Collector but just for the metrics processors?

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/internal/coreinternal/processor/filterspan/filterspan.go

It will allow multiple includes or excludes filters that can be defined using regexp or strict match type.

metrics_generator:
  processor:
    span_metrics:
      dimensions:
        - http.method
        - http.host
      span_filters:
        include:
          - match_type: regexp
            attributes:
              - key: http.method
                value: "GET|POST|PUT|DELETE|HEAD|OPTIONS|TRACE"
        exclude:
         - match_type: strict
           attributes:
            - key: http.host
              value: "some-host-name"

faustodavid avatar Jul 15 '22 15:07 faustodavid

Yeah, that looks like a good reference! We won't be able to vendor the OTel processor directly, but we can mirror the structure of the configuration.

For a first implementation I'd like to keep it as simple as possible (just focus on the minimal features required to make this useful) so our initial implementation might be more limited.

kvrhdn avatar Jul 18 '22 10:07 kvrhdn

Does this help decrease the count of series produced by the metrics generator and therefore does it have cost management applications? Or is it more about just getting metrics about the services or cases you actually care about without having those numbers skewed by the other ones?

09jvilla avatar Nov 04 '22 18:11 09jvilla

If the filters are applied in a way that would reduce cardinality in the metrics produced by the Metrics Generator then potentially there could be a cost reduction.

ie-pham avatar Nov 04 '22 19:11 ie-pham

Based on your response, though, it sounds like that is not the core use case or purpose of doing this? Its more like a nice secondary benefit?

09jvilla avatar Nov 04 '22 19:11 09jvilla

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

github-actions[bot] avatar Jan 04 '23 00:01 github-actions[bot]

This filtering is both to declutter metrics that won't actually be used, but also a cost savings for us.

As we look to broaden our OTel usage, including Faro the metrics generator is required to really harness the value but it will put extreme pressure on active series counts. This is the biggest impact for us, as a Grafana Cloud customer. The previously proposed solution would more than suffice for my particular use case.

NickAdolf avatar Jan 25 '23 11:01 NickAdolf

I think there is a benefit to filtering out the traces before then get sent to the metrics generator, but that might be out of the scope of this issue.

I have a use-case where we receive ~5TB trace data per day, but only want to generate metrics on a specific list of services. It would be great if the distributors could just not send that to the metrics generator and reduce the daily network costs in addition to the cpu/memory consumption of the metrics generator.

This could probably be done in my use-case by splitting Tempo tenants, but then I've got to push that logic down to the otel collector and deal with different services using different tenants even though they are part of the same stack that telemetry data gets referenced across.

It would be nice if the distributor could use a limit set of dimensions to filter with even if it was only tenant and an exact match on the service.name attribute.

lmickh avatar Feb 09 '23 18:02 lmickh

I've got a PR up to approach solving this issue. Review/testing welcome. I'd like to get this running somewhere and put it through the paces, but right now I have reasonable indication this is doing what we've want here based on the tests.

To the notion of filtering at the distributor, I think this is a great idea, but is more complicated due to the nature of the how the distributor works. The best approach to that might be to filter at the head using the agent or similar approach, so the distributor never receives the spans. But perhaps you are wanting to only filter out metrics generation and not the actual ingest of the spans. It might be worth a separate issue about how we could think about how we could approach that issue.

It came up during review with the team, that it might also be nice to filter based on TraceQL, given the amount of effort we're putting into the language. If we could tie into that area of the code, we may be able to re-use some filtering capabilities there, but I suspect this will be a follow-up issue.

zalegrala avatar Apr 07 '23 16:04 zalegrala

Note, any testers will want to read the docs that are also included in #2274.

zalegrala avatar Apr 07 '23 16:04 zalegrala