opentelemetry-collector-contrib feat (processor/k8sattributes): wait for synced when starting k8sattributes processor.

Description: <Describe what has changed.>

When starting k8sattributes processor, block until an initial synchronization has been completed. This solves #32556

Link to tracking Issue: <Issue number if applicable>

#32556

Testing: <Describe what testing was performed and which tests were added.>

Tested in a cluster with constant high span traffic, no more spans with unassociated k8s metadata after adding this pr.

Documentation: <Describe the documentation added.>

Apr 23 '24 07:04 h0cheung

This PR was marked stale due to lack of activity. It will be closed in 14 days.

May 10 '24 05:05 github-actions[bot]

@h0cheung Hi there, I was wondering if you have the time to finalize and merge the pull request. Looking forward to your update. Thanks!

May 30 '24 14:05 orloffv

Thanks for your patience!

I have an advice about the two new options start_timeout and error_when_timeout.

The reason for this enhancement is to make sure the k8sattributes processor will start consuming the incoming traces or metrics until the initial synchronization are completed, otherwise, some traces/metrics/logs may be unassociated with k8s metadata.

But I think some users may care more about the start rate of collector than the completeness of the k8s metadata.

So how about we change this two fields to

force_wait_for_cache_synced bool default is false

wait_for_synced_timeout time.Duration

The users who care more about the completeness could set force_wait_for_cache_synced bool to true and if the synced is still not completed after timeout wait_for_synced_timeout, the collector should directly return error instead of running, since the completeness could not be guaranteed.

@h0cheung @TylerHelmuth

I agree. What are your thoughts? @TylerHelmuth

Jun 11 '24 03:06 h0cheung

@TylerHelmuth can you see, plz?

Jun 24 '24 16:06 orloffv

@fatsheep9146 I like that idea. I suggest adjusting naming it wait_for_metadata and wait_for_metadata_timeout to hide some of the details from the end user. I believe we use the term metadata already to refer to the pod data that is gathered and held.

Jul 02 '24 17:07 TylerHelmuth

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Jul 17 '24 05:07 github-actions[bot]

This PR was marked stale due to lack of activity. It will be closed in 14 days.

remove stale

Jul 25 '24 11:07 h0cheung

@fatsheep9146 I like that idea. I suggest adjusting naming it wait_for_metadata and wait_for_metadata_timeout to hide some of the details from the end user. I believe we use the term metadata already to refer to the pod data that is gathered and held.

Thanks. These two configuration items look simple and clear. I think we can use them.

Jul 25 '24 11:07 h0cheung

@h0cheung Hi, could you take some time to work out the code and merge the pull request? I’d really appreciate your update on this. Thanks a lot

Aug 07 '24 14:08 valner

@fatsheep9146, @TylerHelmuth please take a look

Aug 15 '24 10:08 orloffv

@h0cheung sorry for delay response，could you solve the conflict?

Aug 18 '24 23:08 fatsheep9146

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Sep 02 '24 05:09 github-actions[bot]

@h0cheung Hi there! I was wondering if you could possibly take a little more time to merge the feature? I’d really appreciate it!

Sep 04 '24 08:09 valner

Hello @h0cheung,

We wanted to thank you for your work on this pull request. It is very important to us because the related bug significantly affects our work. If you need any assistance or information to speed up the process, please let us know. We appreciate your efforts and look forward to its completion.

Thank you!

Sep 18 '24 09:09 orloffv

Updated and skipped the lifecycle tests because of failing without KUBERNETES_SERVICE_HOST.

How does prometheus-compliance-tests work? Is its failure caused by this pr?

Sep 19 '24 06:09 h0cheung

It seems like prometheus-compliance-tests are flaky on the main as mentioned in the issue https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35119

Sep 19 '24 07:09 valner

Seems like flaky tests were fixed as mentioned in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35119

Therefore, @h0cheung could you please rebase one more time?

Sep 23 '24 08:09 valner

@TylerHelmuth pls take a look

Sep 23 '24 11:09 orloffv

@h0cheung As I see in the tests we need to run go mod tidy to fix the build. Could you please do it when you have time?

Sep 27 '24 07:09 valner

@h0cheung Hi! It would be wonderful if you have time to do go mod tidy and push changes to the branch.

Oct 10 '24 10:10 valner

@TylerHelmuth pls take a look

Oct 15 '24 17:10 orloffv

opentelemetry-collector-contrib opentelemetry-collector-contrib copied to clipboard

feat (processor/k8sattributes): wait for synced when starting k8sattributes processor.

opentelemetry-collector-contrib
opentelemetry-collector-contrib copied to clipboard