opentelemetry-operator target-allocator doesn't check if CRDs exist before querying them

Component(s)

target allocator

What happened?

Description

We've got the case, where one of our clients is using a target-allocator on the cluster, where not all of monitoring.coreos.com CRDs are installed. They want to monitor only ServiceMonitor and PodMonitor resources, and don't want to monitor Probe and ScrapeConfig resources, since they didn't install them on the cluster.

What happens is that the target-allocator throws errors, still trying to query non-existent CRDs.

{"level":"error","ts":"2025-04-14T05:41:00Z","msg":"Unhandled Error","logger":"UnhandledError","error":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1alpha1.ScrapeConfig: failed to list *v1alpha1.ScrapeConfig: the server could not find the requested resource (get scrapeconfigs.monitoring.coreos.com)","stacktrace":"k8s.io/client-go/tools/cache.DefaultWatchErrorHandler\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:158\nk8s.io/client-go/tools/cache.(*Reflector).Run.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:308\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/client-go/tools/cache.(*Reflector).Run\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:306\nk8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72"}
{"level":"info","ts":"2025-04-14T05:41:38Z","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.Probe: the server could not find the requested resource (get probes.monitoring.coreos.com)"}
{"level":"error","ts":"2025-04-14T05:41:38Z","msg":"Unhandled Error","logger":"UnhandledError","error":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.Probe: failed to list *v1.Probe: the server could not find the requested resource (get probes.monitoring.coreos.com)","stacktrace":"k8s.io/client-go/tools/cache.DefaultWatchErrorHandler\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:158\nk8s.io/client-go/tools/cache.(*Reflector).Run.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:308\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/client-go/tools/cache.(*Reflector).Run\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:306\nk8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72"}
{"level":"info","ts":"2025-04-14T05:41:49Z","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1alpha1.ScrapeConfig: the server could not find the requested resource (get scrapeconfigs.monitoring.coreos.com)"}

We're using a static config for target-allocator passed via ConfigMap, we don't use OpenTelemetryCollector CRD for it. While I was researching the issue, I've thought: maybe you just can't configure Probe and ScrapeConfig via YAML config yet – since I haven't seen it in the readme. So what I tried next is:

creating Probe and ScrapeConfig CRDs in my test cluster, so they can be monitored by target-allocator
disabling PodMonitorSelector and ServiceMonitorSelector that CAN be configured in static config
deleting PodMonitor and ServiceMonitor CRDs from the cluster

And I've got the same error like the one I wrote above, this time it was about PodMonitor and ServiceMonitor CRDs that doesn't exist, even after I've disabled them.

Which means that target-allocator:

queries all of 4 CRDs, no matter how you configure it
doesn't check if CRDs exist, blindly relying on the idea that they exist

I don't think this is not how it should work, so I wanted to propose the idea to check if CRDs exist before querying them. If they don't exist, then print a WARN message and gracefully shutdown the monitoring of resources, CRDs of which don't exist.

If this idea gets a positive reaction, I can even try to implement the fix myself.

Steps to Reproduce

disabling PodMonitorSelector or ServiceMonitorSelector or any other selector that can be configured in static config
deleting PodMonitor or ServiceMonitor or any other CRD from the cluster

Expected Result

target-allocator doesn't query CRD of disabled resource selector

Actual Result

target-allocator queries CRD of disabled resource selector and throws the error if there is no such CRD in the cluster

Kubernetes Version

1.30.5

Operator version

ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:v0.124.0

Collector version

ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.124.1

Environment information

Environment

this issue is relevant for all environments

Log output

{"level":"error","ts":"2025-04-14T05:41:00Z","msg":"Unhandled Error","logger":"UnhandledError","error":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1alpha1.ScrapeConfig: failed to list *v1alpha1.ScrapeConfig: the server could not find the requested resource (get scrapeconfigs.monitoring.coreos.com)","stacktrace":"k8s.io/client-go/tools/cache.DefaultWatchErrorHandler\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:158\nk8s.io/client-go/tools/cache.(*Reflector).Run.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:308\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/client-go/tools/cache.(*Reflector).Run\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:306\nk8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72"}
{"level":"info","ts":"2025-04-14T05:41:38Z","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1.Probe: the server could not find the requested resource (get probes.monitoring.coreos.com)"}
{"level":"error","ts":"2025-04-14T05:41:38Z","msg":"Unhandled Error","logger":"UnhandledError","error":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: Failed to watch *v1.Probe: failed to list *v1.Probe: the server could not find the requested resource (get probes.monitoring.coreos.com)","stacktrace":"k8s.io/client-go/tools/cache.DefaultWatchErrorHandler\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:158\nk8s.io/client-go/tools/cache.(*Reflector).Run.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:308\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:226\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/backoff.go:227\nk8s.io/client-go/tools/cache.(*Reflector).Run\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:306\nk8s.io/client-go/tools/cache.(*controller).Run.(*Group).StartWithChannel.func2\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:55\nk8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1\n\t/home/runner/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:72"}
{"level":"info","ts":"2025-04-14T05:41:49Z","msg":"pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:243: failed to list *v1alpha1.ScrapeConfig: the server could not find the requested resource (get scrapeconfigs.monitoring.coreos.com)"}

Additional context

No response

May 12 '25 11:05 cazorla19

That does sound like something we should fix. I recall trying to make it work the way you describe when adding support for Probes and ScrapeConfigs, but then I don't think we have any tests for this.

For clarity, does the Target Allocator still work, and the problem is just the error logs, or is it completely nonfunctional?

@cazorla19 if you'd like to submit a fix, I'll be happy to review it.

May 21 '25 15:05 swiatekm

+1 we also are seeing this issue with a very similar use case where the Probe and ScrapeConfig CRDs don't exist.

The issue is just the error logs, which for now we've had to mute (but miss if there are any actual errors). Functionality-wise it all still works

May 28 '25 17:05 gracewehner

If the CRD isn't present in the cluster, we shouldn't emit any errors about it. I'm not sure how easy this is going to be in practice, as we use prometheus-operator's machinery for generating scrape configs from CRs, and it quite reasonably assumes all of its CRDs are present.

May 29 '25 10:05 swiatekm

@swiatekm this idea behind prometheus-operator (i.e. assuming CRDs are present instead of putting some extra logic on checking their existence in the cluster) sounds logical – however, I'm still wondering why do we need to query CRDs of those resources that we've disabled. This is one part of this issue – you may disable the scraping of one particular resource, but it still keeps trying to query its CRD.

The use-case I'm trying to tackle is not common, rather a niche one:

the user deployed not all prom-operator's CRDs to the cluster, rather the few ones explicitly selected by him
this could happen for few reasons, such as local-corporate security guidelines or a simple wish to not deploy CRDs that are not going to be used

I understand this might not be easy to handle such use-case, especially when it's closely tied to the dependency and not to the project itself, so I'm wondering how bad it would be to leave this issue unresolved.

Jun 18 '25 09:06 cazorla19

Turns out this is something we do ourselves here. Detecting if a CRD is present is a solved problem because the operator does it here. So this shouldn't be too difficult to fix.

Jun 18 '25 10:06 swiatekm