spire
spire copied to clipboard
Add Distributed Tracing to Spire Agent and Server APIs
As an operator, I'm interested in adding support for distributed tracing in the spire-agent
and spire-server
APIs. This would help me observe interactions between the agent and the server, and also between various clients and workloads of spire as a system. I make extensive use of tracing in my work, and support for tracing in spire would greatly help integrate spire as a first class cloud-native citizen in my work.
I'm suggesting adding tracing to all inter-process API surfaces.
Advantages gained in doing so are multiple:
- Powerful tool for performance work - wherever a profiler would be used in a single-process world, distributed tracing can be used in a distributed system.
- Easy to follow trace of interactions across components, compared to having to piece back the story of an interaction using individual metrics and log lines.
- Integration with external clients that already have tracing in place: it would be awesome to be able to trace through
spire
when debugging production issues of applications higher up the stack. - Powerful debugging tool - distributed tracing helps identity root cause very quickly when brown-out or outages occur. This would be especially helpful in cases where Spire ends up in the request path of an application: e.g. if a JWT SVID is fetched synchronously as part of an RPC call between two applications serving a user request.
Concerns:
- Code related to tracing would be added at every client and server site, making these code path more encumbered with observability related ceremony. This makes the business logic less straightforward to read.
- Tracing needs careful propagation of
ctx
and the tracing context in an application. This means a large code surface area could be impacted by this change, ifctx
s aren't being handled properly everywhere. I don't believe this is the case in the latest APIs. - Tracing vendors and solutions use various wire protocols. A useful implementation would have to support at least Zipkin (for Jaeger and Zipkin), and other common vendors such as LightStep, OpenCensus, Datadog.
- There's some cases where RPC calls between components have a less well defined request-response boundary (such as long lived streaming calls to fetch x509 SVIDs), which would require more careful designing of the spans.
I initially brought up the idea on the Spiffe Slack in #spire.
A few thoughts:
first, I'm definitely interested in getting tracing in SPIRE. We have downstream (in our registration api clients) and upstream (in our UpstreamAuthority) tracing, so having that in SPIRE would be valuable to me.
There's prior art in "bringing your own log sink" with a custom main function. I would think doing something similar to that would be the best approach for any less-common tracing vendors. I'm less eager to provide commercial vendor support by default, but Zipkin and Jaeger seem like good options.
Grpc has tracing interceptors available - we should install those, along with something wrapping the DB. Adding our own tracing around plugin calls, and I think we get reasonable coverage without too much work.
Thanks for putting this proposal together Antoine!
I think we should focus on the request-response apis first (like the registration API), since it's harder for me to understand the value of tracing in the streaming API. (What would the span be?)
I've previously done OpenTracing hooked up to StackDriver Tracing. It was a good combination once I got it to work :) but there were some issues with fields that OpenTracing didn't like to fill in that Stackdriver required, etc. Anyway I'd suggest focusing on Zipkin or Jaeger since I think support will be better in the open source world.
A couple logistical questions:
- Would the server fail to start up if tracing was configured, but it couldn't connect to the tracing API? What if trace uploads started to fail while the server was already running?
- Is there any possibility of leaking security-sensitive information through trace labels? I think there probably isn't, but it's something to keep in mind. There is potential for timing attacks if we have spans around signing operations (so we probably shouldn't do that).
- Can we do what we need entirely through preexisting GRPC interceptors, or do we need individual StartSpans inside our key functions?
- Should we skip OpenTracing and just go straight for the new OpenTelemetry APIs, since they will eventually replace OpenTracing?
Again, thanks for putting this together! I think this would be a valuable contribution to the project!
This issue is stale because it has been open for 365 days with no activity.
This issue was closed because it has been inactive for 30 days since being marked as stale.