opentelemetry-operator
opentelemetry-operator copied to clipboard
Fix the potential issue of duplicate targets
Description: <Describe what has changed.>
There may be a problem of unexpected target duplication -- for example, users rewrite the address during relabeling.
Link to tracking Issue(s): <Issue number if applicable>
- Resolves: #3617
Testing: <Describe what testing was performed and which tests were added.>
add TestDiscovery_NewItem unit test function and adjust the relabel_test.go file to accommodate the new test
@swiatekm Hello, I've submitted this pull request. Could you please review it at your convenience? Thank you!
This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.
Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:
- Make the
TargetURLattribute private and start empty. - Add a
GetTargetURLmethod that returns the attribute value if set, otherwise checks the label and sets it. - Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.
The general processing flow of "Targets" is as follows: Discovery -> NewItem -> SetTargets -> prehook -> assign
My considerations are as follows:
- If deduplication is not performed before assigning the target, it may still lead to duplicate targets being scraped. consistent-hashing can ensure that targets are assigned to the same OTel instance by hashing the
TargetURLafter relabeling to avoid duplication, but least-weighted cannot ensure assignment to the same OTel instance. - The prehook is not currently required to be enabled, so in the above PR, all relabel operations have been moved to
NewItem, and theRelabeledKeepfield information is saved for relabelConfigTargetFilter. - After deduplication, the
TargetURLused byconsistent-hashingcould have remained unchanged. However, I later found that the targets returned by Prometheus are unordered (The targets are obtained by allGroups traversing map), which means the target selected after deduplication may vary each time. Therefore, I added theTargetURLRelabeledfield forconsistent-hashingto perform hashing. This ensures that even if different targets are selected, they will still be assigned to the same OTel instance.
This approach seems wasteful, as you're applying the relabeling step twice. It also circumvents current component responsibility boundaries, as we'd like all relabeling to happen in the prehook.
Instead, I think it'd be much simpler to defer calculating the target url. We only need the url during target allocation, at which point relabeling is guaranteed to be done. What I'd do:
- Make the
TargetURLattribute private and start empty.- Add a
GetTargetURLmethod that returns the attribute value if set, otherwise checks the label and sets it.- Since this attribute is serialized into JSON by the web server, you might need to explicitly set a tag on it.
@swiatekm Modifying the TargetURL alone cannot solve the potential issue of least-weighted strategy scraping duplicate targets. Additionally, since prometheusreceiver will perform relabeling again, the TargetURL provided to OTel must be the original TargetURL (i.e., original __address__).
However, the above PR does circumvent the component responsibility boundaries, and placing the relabel operation in the prehook is indeed more reasonable. Since the deduplication operation is very similar to the target discarding operation, I think the deduplication operation can be placed in the default filter (i.e., relabelConfigTargetFilter) first. Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook. I'd love to hear your thoughts on this.
Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?
If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?
Allright, I see what the problem is. What we need to defer here is not just determining the target url, but also the hash. Is that right @mike9421?
If that's the case, could you hold off on this change for a bit. I have a WIP change that accomplishes the same thing, albeit for different reasons. It is much simpler than what you have here, so I'll submit it and we can see if it addresses your issue. Does that sound ok?
Alright, this PR will be put on hold until your WIP is completed. Which PR corresponds to the WIP? I'd like to see if it resolves the issue.
https://github.com/open-telemetry/opentelemetry-operator/pull/3777 is the PR. It doesn't resolve the issue, but it will make it much simpler to do so. I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.
I think just calculating the Hash and TargetUrl forcefully in the relabel prehook should solve it.
Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:
Afterward, we may need to consider whether multiple prehooks are necessary or if there should be a default prehook.
Yes, a forced relabel operation is required. As I mentioned above, since the prehook is not mandatory (in the current design), the following issue may need to be considered:
Yes, we might need to force it, or at least document that target duplication is possible when it's disabled. Right now it's enabled by default, so it wouldn't be too drastic of a change at least.
More broadly, we do want to look into doing all target relabeling in the target allocator (right now we just drop targets, but don't change the labels themselves). This is more complicated, but it would make us more compatible with Prometheus and resolve these duplication issues.
The updated PR enables passing the required hash value and enhances extensibility, allowing different deduplication strategies through the use of different hash calculation methods for different FilterStrategy. For now, I havenβt found a better approach to pass the relabeled data to func (t *Item) Hash() ItemHash.
@swiatekm @jaronoff97 Hi, could you please help review this PR when you have time and check if any updates are needed?
This PR:
- Updates the scrape target labels to align with Prometheus-processed labels
- Removes relabel_configs
- Adds unit tests comparing the behavior with Prometheus
Itβs a bit challenging to fully decouple these unit tests from Prometheus. Thanks in advance for your review! π
@swiatekm @jaronoff97 This issue is rendering Target Allocator unusable for us as well, could you please review this MR or propose a workaround in the meantime?
E2E Test Results
β34 filesβ Β±0ββ223 suitesβ +2βββ1h 59m 49s β±οΈ + 1m 51s β89 tests +1βββ89 β +1ββ0 π€ Β±0ββ0 β Β±0β 227 runsβ +2ββ227 β +2ββ0 π€ Β±0ββ0 β Β±0β
Results for commit 42078f20.βΒ± Comparison against base commit e5c048b4.
:recycle: This comment has been updated with latest results.
I think this should work. Do we have any existing end to end tests that we can update or view to verify this is working as expected? Thank you for your patience π
Yes, the PR includes unit tests for the changes. I'll be fixing the linting issues and also adding some end-to-end tests to verify this over the next couple of days. Thank you!
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: mike9421 / name: mike9421 (642b92d1b1f0354d31a456e314b23be79bd82f4d, b2659c9b01a1cc7bd505a2b4d7347b6d77c6c43e, 09d28a4986ba87da4d5b2a7d173eaceac59a4054, b715b5683a434e160102d6db72cfcfec423e6e41, d7d629213856b98ca4fcf4db063e12345a7113f0, 61eb412005629dc1560829fb51c8fe051da55f1b, 125489cd60b6684f61f0207acf42d53200152e3f, 6b40a04f029bd998c53b4f66188ac4c382599b4a, 42078f2053d9d97394abd1c734994d0aeaaec2fe, 6617817799c5adea8a09cf3dec28ac1d14ca44e5, 3eff16be7f1835a25f31a14e87f96a6dc0030b88)
- :white_check_mark: login: jaronoff97 / name: Jacob Aronoff (37a78cb32ede2fb1c356cbf4e03257aaed092e6e)
Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:
-
Fixed golint issues.
-
Removed incorrect assert statements.
-
Added end-to-end (e2e) tests.
Hi @jaronoff97 @swiatekm , could you please help review this PR when you have time? Here's a summary of the changes:
- Fixed golint issues.
- Removed incorrect assert statements.
- Added end-to-end (e2e) tests.
@jaronoff97 @swiatekm This issue affects our Target Allocator usage too. Could you please review this MR or propose a workaround?