brave Parent can only be null in a local root

Describe the bug

related issue: https://github.com/spring-cloud/spring-cloud-sleuth/issues/1814

I am using Spring Cloud Sleuth 3.0.1 and Brave 5.3.12 The bug is really sporadic but does cause the worfkflow engine (flowable, which I have instrumented manually) to not call the next tasks in the BPMN workflow.

The error is this:

Exception in thread "orders-flowable-async-job-executor-thread-4" java.lang.AssertionError: Bug (or unexpected call to internal code): parent can only be null in a local root!
	at brave.internal.recorder.PendingSpans.getOrCreate(PendingSpans.java:89)
	at brave.Tracer._toSpan(Tracer.java:410)
	at brave.Tracer.toSpan(Tracer.java:382)
	at brave.Tracer.toSpan(Tracer.java:376)
	at brave.LazySpan.span(LazySpan.java:141)
	at brave.LazySpan.context(LazySpan.java:40)
	at org.springframework.cloud.sleuth.brave.bridge.BraveSpan.context(BraveSpan.java:48)
	at org.springframework.cloud.sleuth.brave.bridge.BraveTracer.nextSpan(BraveTracer.java:52)
	at org.springframework.cloud.sleuth.instrument.async.TraceRunnable.run(TraceRunnable.java:61)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

I think it is on this line in TraceRunable:

I am of the opinion that the TraceRunanble should not prevent the delegate Runnable in any way. We might benefit from a check before calling the nextSpan method or catching this exception somehow.

Sample

I can't reliably reproduce it, it is sporadic. We have on the order of 100 tests using flowable but I get one error every now and then.

Flowable can be set up with a custom implementation of a JobExectuor. I have it set up like so:

 private AsyncExecutor initAsyncExecutor(String tennantId, final int maxConcurrent) {
        final DefaultAsyncJobExecutor asyncExecutor = new DefaultAsyncJobExecutor();
        asyncExecutor.setAutoActivate(false);

        // other setup, irrelevant to the case

        // customized default async job executor initialization with tracing info
        // org.flowable.job.service.impl.asyncexecutor.DefaultAsyncJobExecutor.initAsyncJobExecutionThreadPool
        asyncExecutor.setThreadPoolQueue(new ArrayBlockingQueue<>(queueSize));
        asyncExecutor.setExecutorService(
            new TraceableExecutorService(
                this.applicationContext,
                new ThreadPoolExecutor(
                    asyncExecutor.getCorePoolSize(),
                    asyncExecutor.getMaxPoolSize(),
                    asyncExecutor.getKeepAliveTime(), TimeUnit.MILLISECONDS,
                    asyncExecutor.getThreadPoolQueue(),
                    new BasicThreadFactory.Builder()
                        .namingPattern(tennantId + "-flowable-async-job-executor-thread-%d")
                        .build()
                )
            )
        );

        return asyncExecutor;
    }

Mar 24 '21 09:03 FranPregernik

From the related bug report ...

And I gave it another shot and caught it. I did not modify the TraceRunnable.run method in any way (like before) and I just logged the stack trace in the constructor.

Please double check my reasoning here.

So the regular (passing) case is when the context does not have the parent trace id - it is the local root: 01_regular_passing

With the AssertError caught the context is different - it is not the local root: 02_assert_will_be_thrown

Then I went up the call stack to the LazySpan.span() method (focus on the debuger variables): 03_walking_up_the_stack_to_span_method

So I called manually the tracer.toSpan(context) line - it returns a RealSpan without the AssertError: 04_calling_tracer_toSpan_in_debug

I tested the tracer.toSpan(context).context() call as well to see - also OK: 05_calling_tracer_toSpan_context_in_debug

Finally I tested the this.span() call just for the fun of it - No AssertError the second time around: 06_calling_this_span_no_error

So some race condition is happening but might not be the same as what we are expecting (unprotected this.delegate =). I went to see the captured stack trace of the TraceRunnable constructor and I see that here we have a case where the TraceExecutorService runs one TraceRunnable (the root) and that one later on execues another TraceRunnable and here is where the error happens: 07_interesting_stack_trace

My TraceRunnable: https://gist.github.com/FranPregernik/0d0e0ddff5f0f642f8f12313f42667e1

Mar 24 '21 09:03 FranPregernik

Any chance you can check it out @anuraaga?

Mar 24 '21 12:03 jcchavezs

Hi @jcchavezs, @anuraaga!

Any update on this?

May 11 '21 06:05 FranPregernik

Hi @jcchavezs, @anuraaga, @FranPregernik

I have same issue. Do you know how to fix it or can advise some workaround?

May 26 '21 09:05 r-lysak

I'm impacted by the same issue. After digging in, it appears to happen due to a race condition within brave.Tracer wherein the span is available in the pendingSpans collection when the first line of toSpan completes, but is no longer present when _toSpan checks the collection.

I've added a test case illustrating the issue here: https://github.com/andylintner/brave/commit/86a90861641c29c51bd7a8bf8de734323cdb0ade

Sep 07 '21 20:09 andylintner

brave brave copied to clipboard

Parent can only be null in a local root

brave
brave copied to clipboard