opencensus-specs icon indicating copy to clipboard operation
opencensus-specs copied to clipboard

Proposal: define standard RPC tags

Open semistrict opened this issue 6 years ago • 9 comments

We should define a few standard tags that we believe will be generally useful and have proven to be so internally at Google.

These tags include:

  • originator: the first RPC that triggered the chain of requests (trace)
  • caller: the immediate previous RPC
  • method: the current RPC

Some of the open questions:

  • [ ] Should these be defined across gRPC, HTTP and maybe in future other RPC protocol or should we have separate tags per RPC protocol
  • [ ] Do we need a general mechanism to implement these propagation semantics or should we just hard-code these special tags?
  • [ ] Any other tags we should consider adding to this set?

semistrict avatar May 07 '18 23:05 semistrict

  • What is the value of originator and caller? This only works in systems that have a knowledge of "user" or "service" but does not work in all the systems, because we do not know how to set them. One option could be that the user has to specify them or set a callback on the rpc plugins for example.
  • We used to have method but this tag is not enough, maybe in a combination with protocol_type would work.

Also for reasons that are written here https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md#why-different-tag-name-for-serverclient-method we should make these tags incoming_method and incoming_protocol_type or server_method and server_protocol_type.

I would like to close this asap because I don't think is good to make too many changes that are so critical for the system.

Should these be defined across gRPC, HTTP and maybe in future other RPC protocol or should we have separate tags per RPC protocol

I think in this case we combine the 2 tags that I mentioned, this is a good solution but the downside is that users will have to deal with multiple tags in configs/views. The advantage is that they are very clear where they are added, the disadvantage is that we need special tags for each RPC system and maybe more complicated views when the application uses multiple transport systems.

Do we need a general mechanism to implement these propagation semantics or should we just hard-code these special tags?

If we do use one name hard-coding is good enough, otherwise we can handle this by using what we discuss to implement here https://github.com/census-instrumentation/opencensus-specs/issues/110

Any other tags we should consider adding to this set? I have some tags in my mind (maybe not all of them make sense):

  • protocol_type or transport_type that I mentioned before
  • status may be an interesting tag to handle (but it is different because it propagates backwards, but a standard tag for that is probably good).
  • caller_method the incoming method on the caller.

bogdandrutu avatar May 08 '18 17:05 bogdandrutu

Just to be clear, I'm not proposing that this replace any of the existing tags. I am not proposing to make any changes to the existing spec.

I think that this proposal is compatible with the existing tags because the existing tags aren't suitable to be propagated and this proposal deals mainly with tags that are propagated (besides "method", which I just added for completeness, we can discuss that separately). I think we need both.

semistrict avatar May 08 '18 19:05 semistrict

Don't the doesn't the parent_span_id of a trace equal the caller? If we recurse up parent_span_id, don't we get to the proposed originator? What use of method makes name not already appropriate?

I'm not sure what semantics are being implied here that make these additions useful. It feels to me like the existing span structure already has these forms of links & information built in. I'd like to hear some more of why these aren't already sufficient.

rektide avatar May 16 '18 23:05 rektide

@rektide OpenCensus handles both stats/metrics and tracing. These are really separate concerns for now. Tags (as contemplated in this issue) are only available in the stats part of OpenCensus and don't really have any connection to tracing.

Concretely for example, the parent_span_id is not a useful value for caller because you could not break down stats by this value. It would be meaningless and very high cardinality (so would not work in time series databases). Something like a service name or a user name would be a meaningful caller tag (inside Google it is a user name but not sure where we would get this from so we might rather go with a service name).

semistrict avatar May 17 '18 18:05 semistrict

Could these include a process id information perhaps? (e.g. host/pod name + pid)? These are useful when correlating with specific log events and investigating specific instance performance.

tcolgate avatar Jul 25 '18 09:07 tcolgate

Additionally, one useful element in the opentracing/jaeger instrumentation I have used before is a standard "error" boolean tag, this allows visualisation to easily show spans with known error conditions.

tcolgate avatar Jul 25 '18 09:07 tcolgate

sorry but error boolean is a terrible idea (error=false anyone)... I'd much rather see the absence or availability of an error key with a string payload like Zipkin which allows this visualisation too

basvanbeek avatar Jul 25 '18 10:07 basvanbeek

@basvanbeek I don't think the idea is terrible. Also, let's avoid that kind of discouraging language please.

Having a error key with a string payload may work fine for a tracing system like Zipkin, but if we are talking about RPC tags which go into time series databases, high cardinality string errors are not an option.

I think there is already a low cardinality status tag anyway and trace.Span has a dedicated Status field.

semistrict avatar Jul 25 '18 16:07 semistrict

@ramonza I've already had a bit of practice with some of the less positive tone (fortunately @kjin has been fantastic help over in the NodeJS repo, but for him I'd probably not have stuck around)

If the tags on span may end up propagating into metrics, a raw string tag would be painful. If the low cardinality status tag is already reasonably standard, that may be enough (assuming a status != OK style query is easy to support in tracers).

tcolgate avatar Jul 25 '18 18:07 tcolgate