opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

Bug: Target Allocator: Promotes a otel-collector pod in "pending state" to have targets allocated

Open sfc-gh-akrishnan opened this issue 2 years ago • 3 comments

How to reproduce:

  • Set open-telemetry collector replicas = 2 (or more)
  • Use PodAntiAffinity or request for unavailable resource (cpu / memory) there by the pods are not all schedule-able
  • Though the pods are yet in unschedule-able state, target allocator allocates endpoints to it

Shouldn't we wait for the pod to be scheduled before we allocate target endpoints to it?

sfc-gh-akrishnan avatar Oct 06 '23 20:10 sfc-gh-akrishnan

This is clear if the Pod in question is new, but much less clear if it's an existing Pod being rescheduled. Reassigning targets is a fairly expensive operation for the collectors themselves, as it flushes scrape caches, so we should avoid doing so carelessly. Maybe we should have a configurable timeout for existing Pods, so it's possible to control how much the allocator waits for a Pod before reassigning its targets?

swiatekm avatar Oct 15 '23 14:10 swiatekm

#2528 didn't fix this, it just made the fix easier to implement. The main problem here is that it isn't clear to me what the behaviour should be like. I think the following Pods getting assigned targets works:

  • Pods which are Ready
  • Pods which were ready less than X seconds ago, and are now not ready, but also not Terminating

But I haven't completely thought this through. If anyone can think of any nasty edge cases for this problem, please speak up and let me know.

swiatekm avatar May 01 '24 14:05 swiatekm

sorry that was the auto-closer 😓

jaronoff97 avatar May 01 '24 14:05 jaronoff97

This is now fixed with https://github.com/open-telemetry/opentelemetry-operator/issues/3781 and https://github.com/open-telemetry/opentelemetry-operator/issues/3989.

swiatekm avatar May 25 '25 15:05 swiatekm