opentelemetry-collector-contrib icon indicating copy to clipboard operation
opentelemetry-collector-contrib copied to clipboard

feat (processor/k8sattributes): wait for synced when starting k8sattributes processor.

Open h0cheung opened this issue 1 year ago • 1 comments

Description: <Describe what has changed.>

When starting k8sattributes processor, block until an initial synchronization has been completed. This solves #32556

Link to tracking Issue: <Issue number if applicable>

#32556

Testing: <Describe what testing was performed and which tests were added.>

Tested in a cluster with constant high span traffic, no more spans with unassociated k8s metadata after adding this pr.

Documentation: <Describe the documentation added.>

h0cheung avatar Apr 23 '24 07:04 h0cheung

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar May 10 '24 05:05 github-actions[bot]

@h0cheung Hi there, I was wondering if you have the time to finalize and merge the pull request. Looking forward to your update. Thanks!

orloffv avatar May 30 '24 14:05 orloffv

Thanks for your patience!

I have an advice about the two new options start_timeout and error_when_timeout.

The reason for this enhancement is to make sure the k8sattributes processor will start consuming the incoming traces or metrics until the initial synchronization are completed, otherwise, some traces/metrics/logs may be unassociated with k8s metadata.

But I think some users may care more about the start rate of collector than the completeness of the k8s metadata.

So how about we change this two fields to

  • force_wait_for_cache_synced bool default is false
  • wait_for_synced_timeout time.Duration

The users who care more about the completeness could set force_wait_for_cache_synced bool to true and if the synced is still not completed after timeout wait_for_synced_timeout, the collector should directly return error instead of running, since the completeness could not be guaranteed.

@h0cheung @TylerHelmuth

I agree. What are your thoughts? @TylerHelmuth

h0cheung avatar Jun 11 '24 03:06 h0cheung

@TylerHelmuth can you see, plz?

orloffv avatar Jun 24 '24 16:06 orloffv

@fatsheep9146 I like that idea. I suggest adjusting naming it wait_for_metadata and wait_for_metadata_timeout to hide some of the details from the end user. I believe we use the term metadata already to refer to the pod data that is gathered and held.

TylerHelmuth avatar Jul 02 '24 17:07 TylerHelmuth

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Jul 17 '24 05:07 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

remove stale

h0cheung avatar Jul 25 '24 11:07 h0cheung

@fatsheep9146 I like that idea. I suggest adjusting naming it wait_for_metadata and wait_for_metadata_timeout to hide some of the details from the end user. I believe we use the term metadata already to refer to the pod data that is gathered and held.

Thanks. These two configuration items look simple and clear. I think we can use them.

h0cheung avatar Jul 25 '24 11:07 h0cheung

@h0cheung Hi, could you take some time to work out the code and merge the pull request? I’d really appreciate your update on this. Thanks a lot

valner avatar Aug 07 '24 14:08 valner

@fatsheep9146, @TylerHelmuth please take a look

orloffv avatar Aug 15 '24 10:08 orloffv

@h0cheung sorry for delay response,could you solve the conflict?

fatsheep9146 avatar Aug 18 '24 23:08 fatsheep9146

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions[bot] avatar Sep 02 '24 05:09 github-actions[bot]

@h0cheung Hi there! I was wondering if you could possibly take a little more time to merge the feature? I’d really appreciate it!

valner avatar Sep 04 '24 08:09 valner

Hello @h0cheung,

We wanted to thank you for your work on this pull request. It is very important to us because the related bug significantly affects our work. If you need any assistance or information to speed up the process, please let us know. We appreciate your efforts and look forward to its completion.

Thank you!

orloffv avatar Sep 18 '24 09:09 orloffv

Updated and skipped the lifecycle tests because of failing without KUBERNETES_SERVICE_HOST.

How does prometheus-compliance-tests work? Is its failure caused by this pr?

h0cheung avatar Sep 19 '24 06:09 h0cheung

It seems like prometheus-compliance-tests are flaky on the main as mentioned in the issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35119

valner avatar Sep 19 '24 07:09 valner

Seems like flaky tests were fixed as mentioned in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35119

Therefore, @h0cheung could you please rebase one more time?

valner avatar Sep 23 '24 08:09 valner

@TylerHelmuth pls take a look

orloffv avatar Sep 23 '24 11:09 orloffv

@h0cheung As I see in the tests we need to run go mod tidy to fix the build. Could you please do it when you have time?

valner avatar Sep 27 '24 07:09 valner

@h0cheung Hi! It would be wonderful if you have time to do go mod tidy and push changes to the branch.

valner avatar Oct 10 '24 10:10 valner

@TylerHelmuth pls take a look

orloffv avatar Oct 15 '24 17:10 orloffv