kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

Re-entrant apiserver calls are not part of the same trace

Open dashpole opened this issue 3 years ago • 23 comments

Part of enhancement: kubernetes/enhancements#647

This is a continuation of the discussion in https://github.com/kubernetes/kubernetes/pull/94942#discussion_r657114027, which was a discussion of the risks of abuse that might be possible To summarize, the otelhttp.WithPublicEndpoint() option causes spans generated by the API Server handler to be the start of a new trace (with a link to the incoming trace), rather than be children of the incoming request.

Note: I think there is a bug in OpenTelemetry in which parent contexts are still used for sampling decisions: https://github.com/open-telemetry/opentelemetry-go/issues/2031. I'll make sure that is fixed separately.

However, one of the use-cases outlined in the KEP is tracing reentrant API calls. Using WithPublicEndpoint make reentrant calls difficult to visualize, because they are each their own trace now. We should probably keep the default behavior as using WithPublicEndpoint, but expose an option to make it non-public. But first, we should discuss and document what the abuse implications are for an actor with the ability to send requests to the kubernetes API with a span context. The implications discussed thus far are:

  • If the actor has access to the Trace ID from another trace, they can append spans to it. This wouldn't impact any other spans in the trace, but would append the span to the trace.
  • The actor can set sampled=true in their request to cause the apiserver to record spans for all of their requests. They could potentially overwhelm a tracing backend, or cause other spans to be dropped.

One unanswered question from the thread from @liggitt:

@lavalamp, you've thought about the abuse vectors more w.r.t. priority and fairness... do you have thoughts on this?

dashpole avatar Jun 25 '21 15:06 dashpole

/sig instrumentation /sig apimachinery

cc @logicalhan @lilic

dashpole avatar Jun 25 '21 15:06 dashpole

@dashpole: The label(s) sig/apimachinery cannot be applied, because the repository doesn't have them.

In response to this:

/sig instrumentation /sig apimachinery

cc @logicalhan @lilic

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 25 '21 15:06 k8s-ci-robot

/sig api-machinery

dashpole avatar Jun 25 '21 15:06 dashpole

/assign

logicalhan avatar Jun 25 '21 16:06 logicalhan

We should never be adding to user-controlled traces, IMO.

apiserver should add baggage to its trace headers. Rentrant calls should repeat the baggage. Then the 2nd apiserver can use this to verify that the trace was originated by some apiserver in the cluster. In that case, it should append to a trace. Otherwise, it should start a new one. This should not be an option for the user. There are multiple ways of constructing such a baggage value.

If you have a strong use case for a user starting a trace, let me know, but I don't think it's going to outweigh the anti-abuse perspective.

lavalamp avatar Jun 25 '21 17:06 lavalamp

If you have a strong use case for a user starting a trace, let me know, but I don't think it's going to outweigh the anti-abuse perspective.

I'd like to be able to run kubectl apply -f blah.yaml --trace but I am indifferent as to whether that actually attaches a trace directly or runs through some webhook which validates who i am and then attaches a trace to the request.

logicalhan avatar Jun 25 '21 17:06 logicalhan

I don't see how webhooks could attach a trace.

Letting users request apiserver initiate a trace is different than attaching a request to an already created trace, no?

If we let users request extra resource usage (trace allocation), then IMO we have to authz that somehow.

lavalamp avatar Jun 25 '21 17:06 lavalamp

Maybe I'm phrasing this incorrectly. We can have some mechanism by which either kubectl attaches a trace header directly, or attaches some sort of proxy pseudo trace header which the apiserver can then, after authz-ing (via a webhook or whatever), decide to actually attach a trace to the request.

logicalhan avatar Jun 25 '21 17:06 logicalhan

Maybe a preexisting trace makes sense if folks are using api calls.

I guess my concern is just bounding the resource usage. A non-authz solution might look like a cap on the %age of requests that get traced, and user-requested traces count against that cap (and are ignored if it would put it over the cap).

I don't think it works this way, but I definitely don't want apiserver dialing out to some user-controlled address to report on a trace.

On Fri, Jun 25, 2021 at 10:50 AM Han Kang @.***> wrote:

Maybe I'm phrasing this incorrectly. We can have some mechanism by which either kubectl attaches a trace header directly, or attaches some sort of proxy pseudo trace header which the apiserver can then, after authz-ing (via a webhook or whatever), decide to actually attach a trace to the request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/kubernetes/issues/103186#issuecomment-868733977, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE6BFVBOREZFHM3R25YA6TTUS6XVANCNFSM47KDTUTA .

lavalamp avatar Jun 25 '21 17:06 lavalamp

I don't think it works this way, but I definitely don't want apiserver dialing out to some user-controlled address to report on a trace.

Yeah that's not really what I was saying.

I guess my concern is just bounding the resource usage. A non-authz solution might look like a cap on the %age of requests that get traced, and user-requested traces count against that cap (and are ignored if it would put it over the cap).

You would probably require separate quotas, generally trace sampling is super super low (think 1 in a hundred thousand requests), but yeah I'm a fan of quotas in general.

logicalhan avatar Jun 25 '21 17:06 logicalhan

/triage accepted

fedebongio avatar Jun 29 '21 20:06 fedebongio

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 27 '21 20:09 k8s-triage-robot

/remove-lifecycle stale

dashpole avatar Oct 04 '21 14:10 dashpole

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 02 '22 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 01 '22 15:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-triage-robot avatar Mar 03 '22 15:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 03 '22 15:03 k8s-ci-robot

/reopen /remove-lifecycle rotten

dashpole avatar Mar 03 '22 16:03 dashpole

@dashpole: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 03 '22 16:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 01 '22 16:06 k8s-triage-robot

/remove-lifecycle stale

dashpole avatar Jun 01 '22 17:06 dashpole

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 30 '22 17:08 k8s-triage-robot

/remove-lifecycle stale

logicalhan avatar Sep 22 '22 16:09 logicalhan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 21 '22 17:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 20 '23 18:01 k8s-triage-robot

/remove-lifecycle rotten

dashpole avatar Jul 31 '23 18:07 dashpole

cc @Richabanker

dashpole avatar Jul 31 '23 18:07 dashpole

Would an acceptable solution be to have an option for internal-api-endpoints? If I have network-boundary around my apiserver, then I trust incoming traceparent headers and would propagate the headers?

Including that as a command argument or part of the tracingconfiguration seem reasonable. If someone suggests a path that would be accepted, I'll happily take on implementing it.

mterhar avatar Aug 03 '23 20:08 mterhar

That is a reasonable approach. We need to think about:

  • Should this only exist in the apiserver config? Or is it applicable to all kubernetes components, and belong in component-base?
  • Do we need per-http-server configuration, or is a single option enough? IIRC, there was more than one http server instrumented for apiserver tracing.

dashpole avatar Aug 03 '23 20:08 dashpole

A related use case I have is I want to debug some issues that appear in tests when my webhooks are invoked - they return a 200 response but the API server is failing and returning an EOF context timeout.

To debug this I would like the trace to start when the test makes the call to the API server. For this to be useful I would expect the API server to attach the trace context to outgoing requests.

If I could enable this cluster wide for testing/debugging purposes that would be great.

dprotaso avatar Oct 13 '23 20:10 dprotaso