opentelemetry-specification
opentelemetry-specification copied to clipboard
Sync and Async children (FOLLOWS_FROM)
In OpenTracing, we have CHILD_OF
and FOLLOWS_FROM
. In the new project, we are considering whether to include this concept as a flag on the SpanBuilder
option when setting the span parent. The new naming is proposed to be sync
and async
children, to make the relationship more clear.
Reference PR: https://github.com/bogdandrutu/openconsensus/pull/130
Questions:
- Do we still want this at all? It can be useful for critical path and other types of trace analysis.
- Do we also need an
unknown
flag as well?
@tedsuo How would you label server span with a parent propagated from an http request?
Perhaps a better name would be direct
/indirect
?
(Anything pulled directly from the current context would be direct otherwise indirect)
Perhaps that wouldn't give you the info required for critical path analysis you need... part of the problem is only the caller really knows if they're blocking, where the link/relationship is established by the callee.
{I guess some of this was already discussed in #14}
I don't think sync
and async
are the correct semantics for this. My understanding of ChildOf
and FollowsFrom
is that ChildOf
is the previous operation that directly created the new span, and FollowsFrom
is any previous operation that indirectly caused the new span. This is unrelated to the references being asynchronous or not. Both potentially have value but they are conceptually different.
An example of this in JavaScript is promises. The ChildOf
would generally be the span in the scope where promise.then()
was called, and FollowsFrom
would be the span where resolve()
was called for example.
My take on the questions above:
Do we still want this at all? It can be useful for critical path and other types of trace analysis.
I think both ChildOf/FollowsFrom and sync/async potentially have value for different reasons.
Do we also need an unknown flag as well?
What is the case where this would not be known? I think this is something that is always known in advance.
@rochdev Hey, sorry for the late answer.
Agreed with what you wrote - but what names you think we should use? @tylerbenson already mentioned direct
/indirect
as options, and if you have something in mind feel free to propose it.
I think ChildOf
and FollowsFrom
made a lot of sense. To be honest, semantic-wise I think OpenTracing got a lot of things right.
How was this relationship called in OpenCensus?
@rochdev I am worried that the understanding that of childOf and followsFrom is different for you than for others. I think what you explained is different than what others explained to me, I am very confused now about what is the correct meaning of childOf vs followsFrom.
Moving to the API revision milestone on specification. We need more feedback collected
In Node this concept is very important and is core to how context propagation works in the runtime itself. For this purpose specifically, we call them execution
and trigger
. In Node you can only have one of each because it's limited to function calls, but from a tracing perspective it makes sense that you could follow from multiple different operations.
Let me give an example specific to Node at the language level:
const promise = new Promise((resolve, reject) => {
resolve() // execution ID here it 2
})
// execution ID here is 1
promise.then(() => {
// here, execution ID is 1 and trigger ID is 2
})
The reasoning for the above is that resolve()
is what triggered the execution of the callback, but the callback was actually registered when then()
was called, so that's its execution parent.
For a case like this, we could then say that the callback is running as a ChildOf
the context where then()
was called, and FollowsFrom
the context where resolve()
was called.
It's possible I got this completely wrong and that ChildOf/FollowsFrom
has nothing to do with the relationship described above. I think the best person to explain the real meaning of these is probably @tedsuo.
In general, I think the different wordings proposed in this thread make sense, but they don't necessarily map 1:1 with each other.
The problem here is slightly bigger than what to call these. Kudos @rochdev for thinking OpenTracing got it right, but it didn't, not quite (cf. this blog post). The most fundamental question in analyzing the graph of events is the Lamport's happens-before
relationship. In the OpenTracing span model the following holds:
parent.start happens-before child.start
That's it! Neither child-of
nor follows-from
imply any further causality. Child-of only means parent depends on the outcome of child, in some way. It doesn't mean the parent is blocked - it can be doing other things (thus sync/async naming isn't quite right). It doesn't mean child completes before parent - it looks this way, but parent (RPC caller) may timeout before child (RPC server). In case of such a timeout, OpenTracing does not have a convention on how parent should record that fact (sad face).
The difference between child-of
and follows-from
is useful, in practice, for calculating critical path, but strictly speaking that calculation is not possible since the causality is not captured between the ends of spans, so critical path can only be calculated via a heuristic (I would love to be disproven on this!).
Another odd thing about child-of
and follows-from
is that it's the child span that defines this reference type, even though it talks about parent's dependency on child outcome. If you're a remote server, how do you even know if parent/caller does or does not depend on your outcome? I tend to think of this as the nature of the protocol: producer of a message to Kafka does not respect any response, so the receiver should use follows-from. Sender of HTTP request does expect a response, so the server always uses child-of, even of the sender doesn't care about the outcome - in that case it can internally create a follows-from span first, and then a normal pair of RPC call spans. So it's possible to rationalize this way, but it's still kind of dirty.
Of course, there's always the argument that OpenTelemetry 1.0 is not supposed to improve upon OpenTracing/OpenCensus (convergence is more important than improvements), in which case it doesn't matter much what we call these, because the model would need to be revisited anyway.
Thanks for the clarification @yurishkuro! It sounds like this is a larger discussion then. Would it make more sense to wait then instead of implementing this knowing that it's not necessarily the correct way to handle this relationship?
Of course, this depends on whether users are currently depending on the feature. If that's the case, then I think we should get more information about exactly how it's used which would give us a better understanding of how the currently used feature should be called.
I think a flag for sync / async (blocking / non-blocking) spans would be very useful for trace analysis. In this way you could much easier and quicker identify hot spots along the critical path and, thus, the "root causes" for long trace timings. Such a flag is important because it indicates whether a child span's time is included in the parents time or not. Thus, whether a parent span is slow because of the child span or independently of the child span. Without such a flag, this question cannot be answered.
I agree that in many cases only the caller knows whether a child span is blocking or not, but, in such cases this information could be propagated with the context accordingly, so this information could be added at the child span's side.
To fully capture all (or at least more) of the possible relationships between Spans, in addition to the create(parent)
API, we would need APIs that signal that a span begins/ends waiting for a particular (set of) child(ren) and possibly even that it consumes the result of a particular child. Of course we would need to identify a child on the parent side without the child communicating back it's span ID, which is a whole other problem (but can probably be solved elegantly by introducing IDs not only for the nodes but also the edges in the span graph). Heck, you can even wait synchronously (occupying a thread) or asynchronously (by using something like async/await where other operations can be scheduled while a different one waits for I/O).
To show some difficult cases (pseudo C#) :
// Start async (the child has no idea whether we used a blocking API or an async
// one -- nor should it have to know).
var request = myClient.GetAsync("http://example.com/myAPI");
var myCalcResult = /* some expensive calculation */; // Could be its own subspan
// Block for child request, but not indefinitely
var maybeResult = await request.withTimeout(500);
if (maybeResult.HasValue) {
renderCompletion(myCalcResult, maybeResult.Value); // Consume result
} else {
renderPleaseWait();
// Another operation/span will consume the result (the handle-response part of the
// client could become a child of the server Span, or it could be an independent
// root span with a CONSUMES relation to the server span).
delayedRequests.AddPending(new PendingInfo(myCalcResult, request));
}
Moving to v0.3
Closing this as being accomplished through Links in the current spec.
Links don't address the issue in this ticket. I suggest to close #86 instead, because there's more discussions here.
Links do not solve the problems discussed here (e.g. only the parent having the info whether it waits for the result, hence whether this is a sync/async call).
Please explain why the SpanKind
field does not satisfy this?
For example, it does not allow establishing child-of/follows-from like relationship between internal spans (because SpanKind is currently overloaded).
Also, SpanKind is not an attribute of Links.
Any update on this ?
I think this integration will reduce frustration for people coming from opentracing and it will avoid unnecessary OpenTelemetry customization by the user.
For asynchronous processing where Kafka, RabbitMq, etc.. solutions are involved, it will be nice to have FOLLOWS_FROM span.
I'm noticing some Merged changes related to this issue, but I'm not seeing the actual work merged into any release yet. What's the ETA on getting FOLLOWS FROM functionality built into OpenTelemetry?
I have a NATS producer and want to continue to propagate the trace information in the NATS consumer, just like @OlivierAlbertini mentioned above.
Any update on this? This would be useful for Messaging scenarios as mentioned above. So when the Producer writes messages to the queue the linked messages would be CHILD_OF the producer Span. And on the Consumer when it reads the messages from the Queue the links with the messages would be FOLLOWS_FROM.
I have the feeling there are not enough people who want this strongly enough. If you want to get this rolling, you should probably make a spec PR or OTEP along with a prototype implementation PR e.g. to opentelemetry-java or whatever your preferred language is (best to do this end-to-end from the API to the Jaeger exporter which, I assume, supports this span relationship). EDIT: Or try to bring it up in tomorrows SIG spec Zoom meeting to see what others think.
Duplicate of #562
There is a lot of discussion in this ticket, while #562 is very short and looks more like an implementation detail than a spec question. I don't think they are duplicates, and even if they are I would keep this one for context.
reopening to not lose it.
After giving this a try via #906, we decided to postpone it in order to develop it properly. Re-labeling it so we can add this feature after GA.
any update on this?
+1
Solved via span links.