sdk-java icon indicating copy to clipboard operation
sdk-java copied to clipboard

OpenTracing duplicates user created spans during replay

Open Spikhalskiy opened this issue 4 years ago • 10 comments

Expected Behavior

If we have the following code somewhere in workflow:

void execute() {
  ...
  withSpan(() -> {
    activity.runActivity();
  })
  ...
  //end of replay somewhere down from the above code block
}

If we replay the code above, we should ignore the span creation, because this span was already reported during the original execution.

Actual Behavior

The second span will be created during replay.

Proposed solution

We should create a thin layer on top of Tracer, ScopeManager, Scope or Span that will check if we are still replaying on span finalization and if we are - discard the span instead of finalizing it.

Notes

The issue is raised here: #951 https://github.com/temporalio/sdk-java/pull/951#pullrequestreview-844189585

And the solution is originally proposed here: https://github.com/temporalio/sdk-java/pull/951#issuecomment-1005896960

Spikhalskiy avatar Jan 05 '22 17:01 Spikhalskiy

But what is true, the user-created span may be ignored during replay (depends on its location) with my original proposal. And in your sample, the activity may end up directly under the WorkflowRun span. But this is a smaller issue, let me think about it. I really believe that we can resolve it just as inside a thin Tracer wrapper that can make some smart "buffering" of spans during replays inside and track when isReplaying switches from true to false. Or even simpler, the tracer can always open spans when requested but check isReplaying on .finish() call and discard the span if isReplaying is still true. At first sight, it looks like it will work.

So the problem is that right now with a user created span the activity span will be parentless because the parent span (the user created one) is never emitted. The only way to fix this is to recreate the same span with the exact same spanId on replay.

bergundy avatar Jan 05 '22 17:01 bergundy

@bergundy

So the problem is that right now with a user created span the activity span will be parentless because the parent span (the user created one) is never emitted.

it's not. For this specific reason, JavaSDK always creates WorkflowRun span. As you can see in the test that I pointed and it's javadoc: https://github.com/temporalio/sdk-java/blob/a2bc9373201dd58a607197ac39b7f87c1c0059c2/temporal-opentracing/src/test/java/io/temporal/opentracing/ActivityFailureTest.java#L131 The activity span will not be parentless, it will have WorkflowRun span as a parent.

But yes, it will not have a user-created span as a parent. I'm fairly confident that the proposed solution will address it well.

Spikhalskiy avatar Jan 05 '22 17:01 Spikhalskiy

@bergundy

The only way to fix this is to recreate the same span with the exact same spanId on replay.

It's not. Even more, OpenTracing and OpenTelemetry specifications don't allow this (creation of a span with a specific spanId). A span that was finished can't be just reopened for appending, it's not a part of spec. Check OT SpanBuilder interface.

Each run/replay has its own WorkflowRun span (it's the case right now) The user span should be duplicated under each WorkflowRun span and contain only activities that were actually executed and not replayed in that run. If span was fully executed in the previous run - it shouldn't be recreated in the next run. The decision if the span was fully executed in the previous run should be made on the span .finish call - this way we are sure if we are still replaying when the span is finishing. This solution will be congruent with the current JavaSDK approach and OpenTracing/OpenTelemetry spec. This solution will also be good for debugging, because of the level of detail that it provides.

Spikhalskiy avatar Jan 05 '22 17:01 Spikhalskiy

A span that was finished can't be just reopened for appending

In the case I've illustrated in #951 the span is unfinished (the workflow is evicted from cache before the activity can complete)

bergundy avatar Jan 05 '22 17:01 bergundy

In the case I've illustrated in #951 the span is unfinished (the workflow is evicted from cache before the activity can complete)

In Java the conventional code working with spans looks like:

 Span span = tracer.buildSpan("...").start();
 try (Scope scope = tracer.scopeManager().activate(span)) {
     ...
     activity1.execute()
     activity2.execute()
     ...
 } catch (Exception e) {
     span.log(...);
 } finally {
     span.finish();
 }

Opened scope and span will be finished and reported in case of eviction or any other Exception/Error halfway. As it should. You don't really want to lose all the info collected in the spans (with potentially logs, tags, and other details that can't be restored during replay) including from the underlying activities just because the workflow got evicted. And it will be lost if the parent span is not finished.

Spikhalskiy avatar Jan 05 '22 17:01 Spikhalskiy

Okay, I see, it works differently than in TS where we completely discard the workflow from memory when it's evicted without running any cleanup.

So in this case if the workflow is replayed before it can execute activity2, activity2's parent span will be the root workflow span which isn't that bad.

bergundy avatar Jan 05 '22 17:01 bergundy

@bergundy Yeah, in Java you can't just nuke something without running the corresponding .finally() blocks. Which has its own advantages (here) and creates its own challenges (potential non-determinism if somebody puts an activity execution in .finally()) for us.

Spikhalskiy avatar Jan 05 '22 17:01 Spikhalskiy

There's also another issue with this approach where if we evict the workflow the user's span will end erroneously while when looking at it from the code perspective everything seems fine.

bergundy avatar Jan 05 '22 18:01 bergundy

I don't think that approach that doesn't create a span on replay would work. The problem is that a newly executed code can create a child span and needs a parent span. But parent span wouldn't be present on replay. The ideal solution would allow assigning span-id deterministically on replay. But unfortunately none of the tracing libraries supports assigning span ids.

The other solution is to serialize the span into the history and then recreate it from the serialized data on replay.

mfateev avatar Jan 23 '22 21:01 mfateev

But parent span wouldn't be present on replay.

Don't forget that each workflow run (and replay to this matter) has its own workflow run parent span. So, each replay-followed-by-run has its own Span. This gives us a lot of power and flexibility here.

I will reiterate the solution I am convinced will work here, let's discuss this specific offer, not the objective. Because I don't see any reason why it shouldn't work and it's aligned with the current general spans structure.

Each run/replay has its own WorkflowRun span (it's the case right now) The user span should be duplicated under each WorkflowRun span and contain only activities that were actually executed and not replayed in that run. If span was fully executed in the previous run - it shouldn't be recreated in the next run. The decision if the span was fully executed in the previous run should be made on the span .finish call - this way we are sure if we are still replaying when the span is finishing. This solution will be congruent with the current JavaSDK approach and OpenTracing/OpenTelemetry spec. This solution will also be good for debugging, because of the level of detail that it provides.

Spikhalskiy avatar Jan 23 '22 21:01 Spikhalskiy