opentelemetry-operator icon indicating copy to clipboard operation
opentelemetry-operator copied to clipboard

Fix the potential issue of duplicate targets

Open mike9421 opened this issue 9 months ago β€’ 9 comments

Description: <Describe what has changed.>

There may be a problem of unexpected target duplication -- for example, users rewrite the address during relabeling.

Link to tracking Issue(s): <Issue number if applicable>

  • Resolves: #3617

Testing: <Describe what testing was performed and which tests were added.> add TestDiscovery_NewItem unit test function and adjust the relabel_test.go file to accommodate the new test

mike9421 avatar Feb 23 '25 16:02 mike9421

@swiatekm Hello, I've submitted this pull request. Could you please review it at your convenience? Thank you!

mike9421 avatar Feb 23 '25 16:02 mike9421

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

  1. Make the TargetURL attribute private and start empty.
  2. Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.
  3. Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

swiatekm avatar Feb 24 '25 11:02 swiatekm

The general processing flow of "Targets" is as follows: Discovery -> NewItem -> SetTargets -> prehook -> assign

My considerations are as follows:

  • If deduplication is not performed before assigning the target, it may still lead to duplicate targets being scraped. consistent-hashing can ensure that targets are assigned to the same OTel instance by hashing the TargetURL after relabeling to avoid duplication, but least-weighted cannot ensure assignment to the same OTel instance.
  • The prehook is not currently required to be enabled, so in the above PR, all relabel operations have been moved to NewItem, and the RelabeledKeep field information is saved for relabelConfigTargetFilter.
  • After deduplication, the TargetURL used by consistent-hashing could have remained unchanged. However, I later found that the targets returned by Prometheus are unordered (The targets are obtained by allGroups traversing map), which means the target selected after deduplication may vary each time. Therefore, I added the TargetURLRelabeled field for consistent-hashing to perform hashing. This ensures that even if different targets are selected, they will still be assigned to the same OTel instance.

This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.

Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:

  1. Make the TargetURL attribute private and start empty.
  2. Add a GetTargetURL method that returns the attribute value if set, otherwise checks the label and sets it.
  3. Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.

@swiatekm Modifying the TargetURL alone cannot solve the potential issue of least-weighted strategy scraping duplicate targets. Additionally, since prometheusreceiver will perform relabeling again, the TargetURL provided to OTel must be the original TargetURL (i.e., original __address__).

However, the above PR does circumvent the component responsibility boundaries, and placing the relabel operation in the prehook is indeed more reasonable. Since the deduplication operation is very similar to the target discarding operation, I think the deduplication operation can be placed in the default filter (i.e., relabelConfigTargetFilter) first. Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook. I'd love to hear your thoughts on this.

mike9421 avatar Mar 02 '25 15:03 mike9421

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

swiatekm avatar Mar 03 '25 12:03 swiatekm

Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?

If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?

Alright, this PR will be put on hold until your WIP is completed. Which PR corresponds to the WIP? I'd like to see if it resolves the issue.

mike9421 avatar Mar 06 '25 15:03 mike9421

https://github.com/open-telemetry/opentelemetry-operator/pull/3777 is the PR. It doesn't resolve the issue, but it will make it much simpler to do so. I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

swiatekm avatar Mar 06 '25 15:03 swiatekm

I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook.

mike9421 avatar Mar 08 '25 03:03 mike9421

Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:

Yes, we might need to force it, or at least document that target duplication is possible when it's disabled. Right now it's enabled by default, so it wouldn't be too drastic of a change at least.

More broadly, we do want to look into doing all target relabeling in the target allocator (right now we just drop targets, but don't change the labels themselves). This is more complicated, but it would make us more compatible with Prometheus and resolve these duplication issues.

swiatekm avatar Mar 08 '25 15:03 swiatekm

The updated PR enables passing the required hash value and enhances extensibility, allowing different deduplication strategies through the use of different hash calculation methods for different FilterStrategy. For now, I haven’t found a better approach to pass the relabeled data to func (t *Item) Hash() ItemHash.

mike9421 avatar Apr 14 '25 17:04 mike9421

@swiatekm @jaronoff97 Hi, could you please help review this PR when you have time and check if any updates are needed?

This PR:

  • Updates the scrape target labels to align with Prometheus-processed labels
  • Removes relabel_configs
  • Adds unit tests comparing the behavior with Prometheus

It’s a bit challenging to fully decouple these unit tests from Prometheus. Thanks in advance for your review! πŸ™

mike9421 avatar Aug 02 '25 20:08 mike9421

@swiatekm @jaronoff97 This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime?

eenchev avatar Sep 09 '25 19:09 eenchev

E2E Test Results

 34 files  Β±0  223 suites  +2   1h 59m 49s ⏱️ + 1m 51s  89 tests +1   89 βœ… +1  0 πŸ’€ Β±0  0 ❌ Β±0  227 runsβ€Š +2  227 βœ… +2  0 πŸ’€ Β±0  0 ❌ Β±0 

Results for commit 42078f20. ± Comparison against base commit e5c048b4.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Sep 09 '25 20:09 github-actions[bot]

I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience πŸ™‡

Yes, the PR includes unit tests for the changes. I'll be fixing the linting issues and also adding some end-to-end tests to verify this over the next couple of days. Thank you!

mike9421 avatar Sep 11 '25 09:09 mike9421

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: mike9421 / name: mike9421 (642b92d1b1f0354d31a456e314b23be79bd82f4d, b2659c9b01a1cc7bd505a2b4d7347b6d77c6c43e, 09d28a4986ba87da4d5b2a7d173eaceac59a4054, b715b5683a434e160102d6db72cfcfec423e6e41, d7d629213856b98ca4fcf4db063e12345a7113f0, 61eb412005629dc1560829fb51c8fe051da55f1b, 125489cd60b6684f61f0207acf42d53200152e3f, 6b40a04f029bd998c53b4f66188ac4c382599b4a, 42078f2053d9d97394abd1c734994d0aeaaec2fe, 6617817799c5adea8a09cf3dec28ac1d14ca44e5, 3eff16be7f1835a25f31a14e87f96a6dc0030b88)
  • :white_check_mark: login: jaronoff97 / name: Jacob Aronoff (37a78cb32ede2fb1c356cbf4e03257aaed092e6e)

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

  • Fixed golint issues.

  • Removed incorrect assert statements.

  • Added end-to-end (e2e) tests.

mike9421 avatar Sep 15 '25 19:09 mike9421

Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:

  • Fixed golint issues.
  • Removed incorrect assert statements.
  • Added end-to-end (e2e) tests.

@jaronoff97 @swiatekm This issue affects our Target Allocator usage too. Could you please review this MR or propose a workaround?

pandoralink avatar Sep 22 '25 02:09 pandoralink