brave
brave copied to clipboard
Parent can only be null in a local root
Describe the bug
related issue: https://github.com/spring-cloud/spring-cloud-sleuth/issues/1814
I am using Spring Cloud Sleuth 3.0.1 and Brave 5.3.12 The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.
The error is this:
Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
at brave.Tracer._toSpan(Tracer.java:410)
at brave.Tracer.toSpan(Tracer.java:382)
at brave.Tracer.toSpan(Tracer.java:376)
at brave.LazySpan.span(LazySpan.java:141)
at brave.LazySpan.context(LazySpan.java:40)
at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I think it is on this line in TraceRunable:
I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan
method or catching this exception somehow.
Sample
I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.
Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:
private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
asyncExecutor.setAutoActivate(false);
// other setup, irrelevant to the case
// customized default async job executor initialization with tracing info
// org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
asyncExecutor.setExecutorService(
new TraceableExecutorService(
this.applicationContext,
new ThreadPoolExecutor(
asyncExecutor.getCorePoolSize(),
asyncExecutor.getMaxPoolSize(),
asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
asyncExecutor.getThreadPoolQueue(),
new BasicThreadFactory.Builder()
.namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
.build()
)
)
);
return asyncExecutor;
}
From the related bug report ...
And I gave it another shot and caught it. I did not modify the TraceRunnable.run
method in any way (like before) and I just logged the stack trace in the constructor.
Please double check my reasoning here.
So the regular (passing) case is when the context does not have the parent trace id - it is the local root:
With the AssertError caught the context is different - it is not the local root:
Then I went up the call stack to the LazySpan.span()
method (focus on the debuger variables):
So I called manually the tracer.toSpan(context)
line - it returns a RealSpan without the AssertError:
I tested the tracer.toSpan(context).context()
call as well to see - also OK:
Finally I tested the this.span()
call just for the fun of it - No AssertError the second time around:
So some race condition is happening but might not be the same as what we are expecting (unprotected this.delegate =
).
I went to see the captured stack trace of the TraceRunnable constructor and I see that here we have a case where the TraceExecutorService runs one TraceRunnable (the root) and that one later on execues another TraceRunnable and here is where the error happens:
My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1
Any chance you can check it out @anuraaga?
Hi @jcchavezs, @anuraaga!
Any update on this?
Hi @jcchavezs, @anuraaga, @FranPregernik
I have same issue. Do you know how to fix it or can advise some workaround?
I'm impacted by the same issue. After digging in, it appears to happen due to a race condition within brave.Tracer
wherein the span is available in the pendingSpans collection when the first line of toSpan
completes, but is no longer present when _toSpan
checks the collection.
I've added a test case illustrating the issue here: https://github.com/andylintner/brave/commit/86a90861641c29c51bd7a8bf8de734323cdb0ade