opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

metricKeyToDimensions cache is printing many stack traces

Open greut opened this issue 3 years ago • 21 comments
trafficstars

Describe the bug

https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6503 the additions of this LRU cache treats the missing key as an error, and prints a big stacktrace in the logs.

Steps to reproduce

Have either high dimensions or small dimensions_cache_size

What did you expect to see?

Either a metric for cache/hit ratio or a much lower log level, such as debug.

What did you see instead?

A lot of stacktraces

2022-04-04T03:10:26.171Z        error   [email protected]/processor.go:330   value not found in metricKeyToDimensions cache by key "copernic\x00getUserInfo\x00SPAN_KIND_SERVER\x00STATUS_CODE_UNSET"      {"kind": "processor", "name": "spanmetrics", "pipeline": "traces"}                                                                                     
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).collectCallMetrics                                                                                                                                                                                                                                                  
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:330                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).buildMetrics                                                                                                                                                                                                                                                        
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:258                                                                                                                                                                                                                                                    
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces                                                                                                                                                                                                                                                       
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:232                                                                                                                                                                                                                                                    
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1                                                                                                                                                                                                                                                                                             
        go.opentelemetry.io/[email protected]/processor/processorhelper/traces.go:70                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces                                                                                                                                                                                                                                                                                                       
        go.opentelemetry.io/[email protected]/consumer/traces.go:36                                                                                                                                                                                                                                                                                                          
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*jReceiver).PostSpans                                                                                                                                                                                                                                                                     
        github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:290                                                                                                                                                                                                                                                      
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler.func1                                                                                                                                                                                                                                                                                   
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:208                                                                                                                                                                                                                                                                                         
go.opentelemetry.io/collector/config/configgrpc.enhanceWithClientInformation.func1                                                                                                                                                                                                                                                                                           
        go.opentelemetry.io/[email protected]/config/configgrpc/configgrpc.go:382                                                                                                     
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1116                                                                                                                                                                                                                                                                                                                        
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.UnaryServerInterceptor.func1                                                                              
        go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/[email protected]/interceptor.go:325                                                                                                                                                                                                                                                               
google.golang.org/grpc.chainUnaryInterceptors.func1.1                                                                                                                                 
        google.golang.org/[email protected]/server.go:1119                                                                                                                                 
google.golang.org/grpc.chainUnaryInterceptors.func1                                                                                                                                   
        google.golang.org/[email protected]/server.go:1121                                                                                                                                 
github.com/jaegertracing/jaeger/proto-gen/api_v2._CollectorService_PostSpans_Handler                                                                                                  
        github.com/jaegertracing/[email protected]/proto-gen/api_v2/collector.pb.go:210                                                                                                  
google.golang.org/grpc.(*Server).processUnaryRPC                                                                                                                                      
        google.golang.org/[email protected]/server.go:1282                                                                                                                                 
google.golang.org/grpc.(*Server).handleStream                                                                                                                                         
        google.golang.org/[email protected]/server.go:1619                                                                                                                                 
google.golang.org/grpc.(*Server).serveStreams.func1.2                                                                                                                                 
        google.golang.org/[email protected]/server.go:921  

What version did you use? Version: v0.48.0

What config did you use? Config: (e.g. the yaml config file)

Environment OS: Debian 11 Compiler(if manually compiled): vanilla docker image

Additional context

:blue_heart:

greut avatar Apr 04 '22 11:04 greut

cc @albertteoh

jpkrohling avatar Apr 06 '22 12:04 jpkrohling

For posterity, please refer to this related discussion: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1649330166040589.

albertteoh avatar Apr 13 '22 10:04 albertteoh

Also hit this issue when tried to use Jaeger Monitor function in jeagertracing 1.33. @albertteoh, I cannot access the cloud-native slack channel, could you paste the discussion here ? Thanks.

gnozil avatar Apr 25 '22 15:04 gnozil

Hi all :wave: , I'd like instrument an exporter component to emit a counter metric each time there is a cache miss and avoid logging such an event (context: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/9018). Wondering if someone could please point me to an existing example where the otel-collector emits metrics about itself for a particular component?

Related issue: https://github.com/open-telemetry/opentelemetry-collector/issues/4198

I think the proposal is to use otel-go to emit collector self-metrics, but the otel-go SDK is not quite ready for v1.x, which should contain the self-metrics feature; although there has been steady progress towards this goal it seems.

albertteoh avatar Apr 27 '22 10:04 albertteoh

I'm also experiencing a significant flood of these errors (at times > 600/sec). If these are harmless, can they be lowered to info or debug?

If I read the comment above correctly, there's some changes coming that would get rid if the code emitting these errors (right?), but if this is still a ways out, reducing the level of the log seems easy and quick enough. As is, this is preventing me from using this feature (errors in our logs that aren't errors will just cause lots of misunderstanding).

jvilhuber avatar May 10 '22 06:05 jvilhuber

I am seeing a bunch of these errors in our deployment as well, it would be good to know if these are harmless errors. I am wondering if the spans that hit this error go unprocessed. Looking at that part of the code briefly, it seems like that may be the case (correct me if I am wrong).

arun-shopify avatar May 31 '22 14:05 arun-shopify

Yes, @arun-shopify, the collector instance is not processing anymore traces and have to restart the collector instance to fix the error. It becomes a blocker to migrating to opentelemetry as it's breaking few time a day and to restart to fix it

logamanig avatar Jun 02 '22 14:06 logamanig

This is not only a log issue, the otel-collector instance will stop working and result traces loss. In our deployment, the otel-collector is used for extracting traces metrics and store to Prometheus, the issue broke the traces forwarding to jaeger collector.

gnozil avatar Jun 05 '22 15:06 gnozil

This is our experience as well. This will cause the collector to stop sending the traces further in the service pipeline. The healthcheck continues to return healthy resulting in 'stuck' collectors.

This is much more than a log entry. @greut or @albertteoh the issue doesn't indicate the severity of this bug. Can we modify this issue to say that it stops processing? Or should we create a new issue?

forestsword avatar Jun 15 '22 11:06 forestsword

we didn't experience any service disruption, only verbose logs.

greut avatar Jun 15 '22 12:06 greut

Agree, the problem is more serious than just a high volume of logging. It involves the loss of trace data.

I haven't invested the time to root-cause this properly yet, but I suspect that the problem is the spanmetrics processor is creating the key for the metrics dimensions, caching, then quite reasonably, expecting it to still exist in the cache so that it can build the metrics with the metrics dimensions: processor.go#L233-L235.

However, because there's no lock between "setting" the key into cache and "getting" the key from cache, I think there's a potential race.

I suspect one issue is with the ContainsOrAdd method used here. Its documented comment reads:

// ContainsOrAdd checks if a key is in the cache without updating the // recent-ness or deleting it for being stale, and if not, adds the value. // Returns whether found and whether an eviction occurred.

I see two options so far:

  1. In cache(...), attempt to Get the key first (which updates recent-ness and should be a relatively cheap operation), and if it doesn't exist, then Add. I'm not 100% confident if that resolves the root-cause, but I feel that at least it's the more correct thing to do regardless.

  2. Another option is to lock the two lines of code that should be synchronous. This comes at the cost of longer CPU time spent within a mutex lock; but it should simplify the code as it removes the need for a number of locks in that call path.

Would be happy to get any feedback from folks about this or if anyone has looked at this problem at more depth than I have. 😄

albertteoh avatar Jun 15 '22 12:06 albertteoh

I agree with @albertteoh , it's the spanmetrics processor that's breaking the collector. Disabling spanmetrics is the only option to send traces without loss.

My environment:

Trace pipeline: Otlp receiver Spanmetrics and batch processor Jaeger exporter

Metric pipeline: Prometheus receiver that collects collector's own receives metrics Prometheus exporter(it is also the exporter of spanmetrics)

As mentioned by @forestsword , collector's pod status is healthy though it can't send traces. The only way to identify is to check if trace backend received any traces recently. When having multiple replicas, traces are lost only for affected pods. So, reproduce this issue, I suggest to use single replica and check trace backend(for me jaeger) if traces are still being received after the error logged.

logamanig avatar Jun 15 '22 13:06 logamanig

PR to address this issue: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/11149/

albertteoh avatar Jun 23 '22 10:06 albertteoh

We have the same issue, I have to reboot otel collector everyday ...

amoscatelli avatar Jul 06 '22 07:07 amoscatelli

PR to address this issue: #11149

@albertteoh, when will this PR be resolved ? Thanks.

gnozil avatar Jul 06 '22 08:07 gnozil

@albertteoh, when will this PR be resolved ? Thanks.

@gnozil this was merged 2 days ago.

albertteoh avatar Jul 14 '22 12:07 albertteoh

@albertteoh I'm using v0.59.0 but still seeing this issue.

nguyenhoaibao avatar Sep 09 '22 07:09 nguyenhoaibao

@albertteoh I'm using v0.59.0 but still seeing this issue.

Are you able to share your otel collector config and stack trace of the error please, @nguyenhoaibao

albertteoh avatar Sep 09 '22 12:09 albertteoh

Hi @greut (+ others), just following up to see if you've had a chance to upgrade and if this issue still exists as @nguyenhoaibao reported? I'd like to understand if there's still further investigation required and if so, if you're able to provide your configs and the error stack trace again please.

albertteoh avatar Sep 16 '22 11:09 albertteoh

Since going from 0.53 to 0.59 10 days ago, that error is gone from our logs.

Screenshot 2022-09-16 at 14-33-24 Discover - OpenSearch Dashboards

EDIT btw thanks for the fixes!!

greut avatar Sep 16 '22 12:09 greut

A semi work-around for us was to run with 10k cache and to only process server and consumer spans using a filter (inspiration from https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/12615).

I can confirm it still fails on 0.59.0 but with the work-around it takes about a day for it to happen.

I've just removed my workarounds to reproduce it. There's no logs of an error. But I do see cpu spiking when this happen.

Screenshot 2022-09-20 at 14-45-57 - Grafana

Once it hits its limit it looks like it get throttled pretty dramatically. Then poof nothing. Spans begin to be refused by the otlp receiver and the accumulator stops logging anything. This line disappears after:

2022-09-20T11:58:31.237Z  debug [email protected]/accumulator.go:288 metric expired: latency {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}

It continues to remain healthy but refuses all in coming spans.

I've grabbed some pprof profiles if you're interesting @albertteoh I could give you them via slack. Could this be in issue in the prometheusexporter itself?

forestsword avatar Sep 20 '22 14:09 forestsword

Experiencing the exact same behaviour as above post. Ran into value not found in metricKeyToDimensions cache by key logs, upgraded versions to the 'fix' in v0.59.0, log messages disappeared. But it still stops processing incoming spans at a certain point.

Is it possible the fix just bandaided the log emission but didn't fix the root cause? I'm having to setup a job that restarts our deployment everyday or we lose all spans that we want processed into spanmetrics.

What's worse is it just silently fails. Nothing in the logs. But just doesn't process the spans. Not great!

Bruno-DaSilva avatar Nov 12 '22 05:11 Bruno-DaSilva

We finally removed the otel-colllector from our environment. Originally we supposed the traces metrics from otel-collector will be useful for Jeager-UI Monitor tab, however, it showed that Monitor tab in Jeager-UI is much less useful -- there is not filter function there and the metrics from different environments mixed, that made the data meaningless at all. As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore. After removing otel-collector, we have not seen trace loss any more, then we suspected the original trace loss issue was also because of otel-collector.

gnozil avatar Nov 14 '22 02:11 gnozil

@Bruno-DaSilva @forestsword @greut the bug indeed hasn't yet been fixed and https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/11149 was just a speculative attempt at a fix without being able to reproduce the problem reliably, or perhaps there was a regression introduced. The spanmetrics processor was still propagating the error out of the processor, so I'm not sure why the error wasn't being logged anymore in v0.59.0.

Fortunately, I just found out a way to reliably reproduce this error by reducing the configurable dimensions_cache_size down to a small number (e.g. 2). I've opened a PR to address this (with thanks to @Frapschen).

albertteoh avatar Nov 15 '22 12:11 albertteoh

As we got rid of Jaeger-UI Monitor function, then no need to run otel-collector anymore

Same, otelcontrib instead of the jaeger-agent on the host and then jaeger-collector. Hoping for a best future of the fully integrated jaeger-collector 2

greut avatar Nov 15 '22 15:11 greut