opencensus-specs Add option "recordLatencyStats" in the span constructor

The behavior of this bit will be to record a measurement with the span latency when the Span is ended:

The latency measurement will be recorded with the existing tags in the context when end() is called plus extra two tags:

Tag name	Description
opencensus_span_name	The Span name, e.g. com.exampleapi.v4.BookshelfService/Checkout
opencensus_span_status	Span status code , e.g. OK, CANCELLED, DEADLINE_EXCEEDED

We will record the span latency in "ms" as double.

Measure name	Unit	Description
opencensus.io/trace/span_latency	ms	The latency of the Span in milliseconds.

We will expose 2 default views:

View name	Measure	Aggregation	Tags
opencensus.io/trace/span_latency	opencensus.io/tracer/server_latency	distribution	opencensus_span_name, opencensus_span_status
opencensus.io/trace/completed_spans	opencensus.io/tracer/server_latency	count	opencensus_span_name, opencensus_span_status

Observations:

One of the trivial question is if we need also a "opencensus.io/trace/started_spans" measurement and default view. For the moment there is no clear requirement for that but to leave our option to add this later probably we should have an enum instead of the bit that I proposed with only one value RECORD_SPAN_STATS{ END_SPAN_ONLY } (later we can add START_SPAN_ONLY and START_AND_END_SPAN.
This option is independent of sampled or recording events. That means that if the span is sampled or we record events locally does not mean we record stats for this span.
This logic initially can be implemented in the end() function of the Span implementation, but as an optimization it can be done async.

Alternative consideration:

Add start/end Span callbacks:
- The reason this is not ideal is because these callbacks will be called on the critical path and will not let us to record these stats async. We can call the stats API but that is less optimal than doing the entire work async.
- Time will be measured twice and can be different (alternatively we can always record the start/end times for the span, but this is not optimal).

In this proposal I don't propose to never add start/end Span callbacks which I think are useful for other reasons, but I try to propose something that can be implemented by the core library and it is very useful even if we support start/end Span callbacks.

Sep 28 '18 02:09 bogdandrutu

This proposal is only required as we do not have catch-all exporter, correct? If we would have catch-all exporter, this metric may be implemented as a such exporter, isn't it?

Sep 28 '18 03:09 SergeyKanzhelev

Specifically for the proposal - span.kind is another important tag for this metric. Typically we show incoming spans separately from outgoing

Sep 28 '18 03:09 SergeyKanzhelev

@SergeyKanzhelev for the span.Kind +1.

@SergeyKanzhelev catch all exporter can be useful here, but for performance reasons I think we need to only record start/end time just for this. So an alternative option would be to have a level of informations that we record: only_latency, all_trace_events. In order to achieve the performance we want we need an interface for catch all exporter that has hooks sync and async:

public abstract class CatchAllExporter {
  // If it returns true call StartAsync.
  // Maybe here we call before we actually start the span to determine the
  // level so the argument may be span_context, name, etc.
  public abstract boolean StartEventSync(Span); 

  // Run only if the StartEventSync returned true and at an arbitrary time.
  public abstract void StartEventAsync(Span /* or SpanData? */);

  // If it returns true save the span for async process.
  public abstract boolean EndEventSync(Span);

  // Run only if the EndEventSync returned true and at an arbitrary time.
  public abstract void EndEventAsync(Span /* or SpanData? */);
}

I think I entered into too many details. I see a need for having a level that says record_only_start_end_time. Do you think we should start the catch-all interface? Happy to do that if you think that is needed.

Sep 28 '18 03:09 bogdandrutu

catch all exporter can be useful here, but for performance reasons I think we need to only record start/end time just for this.

Can you please elaborate? Do you mean Span and SpanData will not be created, but metric for those will be incremented? We need catch all for zpages anyway. So why there will be a perf impact?

Yes, it would be nice to discuss catch app. I'm thinking soon I'll get to zpages implementation for C# and I'd prefer to have them implemented as pluggable catch-all module rather than built-in thing.

Sep 28 '18 04:09 SergeyKanzhelev

Proposed a more generic solution in https://github.com/census-instrumentation/opencensus-specs/issues/191. This proposal will be just an implementation of the proposed SpanHandler and record_stats bit that will become the TracingLevel.RECORD_LATENCY.

Sep 28 '18 17:09 bogdandrutu