dd-trace-java icon indicating copy to clipboard operation
dd-trace-java copied to clipboard

Kotlin awaiting not traced correctly.

Open ikovalyov opened this issue 5 years ago • 11 comments

If you await for the deferred returned by coroutine while being in another coroutine your span information is lost and whole span is being recreated as if it was parent span. See example project:

https://github.com/ikovalyov/dd-tracing-coroutines-issue-v2

Application logs:

> Task :run
[dd.trace 2019-12-02 09:23:39:910 +0100] [main] INFO datadog.trace.agent.jmxfetch.JMXFetch - JMXFetch config: null [] [] [] null null {runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, service=unnamed-java-app} statsd:localhost:8125 System.err INFO
[dd.trace 2019-12-02 09:23:40:502 +0100] [main] INFO datadog.trace.agent.ot.DDTraceOTInfo - dd-trace - version: 0.38.0~1cb704e7
[dd.trace 2019-12-02 09:23:40:571 +0100] [main] INFO datadog.trace.agent.ot.DDTracer - New instance: DDTracer-77bb48d5{ serviceName=unnamed-java-app, writer=DDAgentWriter { api=DDApi { tracesUrl=http://localhost:8126/v0.3/traces } }, sampler=datadog.trace.agent.common.sampling.RateByServiceSampler@181d8899, defaultSpanTags={}}
[dd.trace 2019-12-02 09:23:40:600 +0100] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-trace-ot - version: 0.38.0~1cb704e7
[dd.trace 2019-12-02 09:23:40:600 +0100] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-trace-api - version: 0.38.0~1cb704e7
[dd.trace 2019-12-02 09:23:40:601 +0100] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-java-agent - version: 0.38.0~1cb704e7
2019-12-02 09:23:41.840  INFO 3998 [           main] Main$main$1                              : main DDSpan [ t_id=2654422379082193946, s_id=5753717729389507729, p_id=0] trace=unnamed-java-app/trace.annotation/Main.main metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=1, thread.name=main}, duration_ns=0
2019-12-02 09:23:41.892  INFO 3998 [atcher-worker-2] Main                                     : DefaultDispatcher-worker-2 DDSpan [ t_id=2654422379082193946, s_id=1179641962218689830, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=0, component=trace, thread.id=26, thread.name=DefaultDispatcher-worker-2}, duration_ns=0
2019-12-02 09:23:41.893  INFO 3998 [atcher-worker-2] Main                                     : DefaultDispatcher-worker-2 DDSpan [ t_id=2654422379082193946, s_id=1179641962218689830, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=0, component=trace, thread.id=26, thread.name=DefaultDispatcher-worker-2}, duration_ns=0
2019-12-02 09:23:41.899  INFO 3998 [atcher-worker-3] Main                                     : DefaultDispatcher-worker-3 DDSpan [ t_id=2654422379082193946, s_id=2170501652924201656, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=1, component=trace, thread.id=27, thread.name=DefaultDispatcher-worker-3}, duration_ns=0
2019-12-02 09:23:41.900  INFO 3998 [atcher-worker-4] Main                                     : DefaultDispatcher-worker-4 DDSpan [ t_id=2654422379082193946, s_id=6175891178351208339, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=3, component=trace, thread.id=28, thread.name=DefaultDispatcher-worker-4}, duration_ns=0
2019-12-02 09:23:41.899  INFO 3998 [atcher-worker-1] Main                                     : DefaultDispatcher-worker-1 DDSpan [ t_id=2654422379082193946, s_id=8485714618033939065, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=2, component=trace, thread.id=25, thread.name=DefaultDispatcher-worker-1}, duration_ns=0
2019-12-02 09:23:41.900  INFO 3998 [atcher-worker-3] Main                                     : DefaultDispatcher-worker-3 DDSpan [ t_id=2654422379082193946, s_id=2170501652924201656, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=1, component=trace, thread.id=27, thread.name=DefaultDispatcher-worker-3}, duration_ns=0
2019-12-02 09:23:41.902  INFO 3998 [atcher-worker-5] Main                                     : DefaultDispatcher-worker-5 DDSpan [ t_id=2654422379082193946, s_id=3463457279142380734, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=4, component=trace, thread.id=29, thread.name=DefaultDispatcher-worker-5}, duration_ns=0
2019-12-02 09:23:41.903  INFO 3998 [atcher-worker-6] Main                                     : DefaultDispatcher-worker-6 DDSpan [ t_id=2654422379082193946, s_id=4348256562928250323, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=6, component=trace, thread.id=30, thread.name=DefaultDispatcher-worker-6}, duration_ns=0
2019-12-02 09:23:41.904  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=2654422379082193946, s_id=4421324058712740639, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=5, component=trace, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:41.905  INFO 3998 [atcher-worker-8] Main                                     : DefaultDispatcher-worker-8 DDSpan [ t_id=2654422379082193946, s_id=8537738308186511090, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=7, component=trace, thread.id=32, thread.name=DefaultDispatcher-worker-8}, duration_ns=0
2019-12-02 09:23:41.906  INFO 3998 [tcher-worker-10] Main                                     : DefaultDispatcher-worker-10 DDSpan [ t_id=2654422379082193946, s_id=1985917427252727825, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=9, component=trace, thread.id=34, thread.name=DefaultDispatcher-worker-10}, duration_ns=0
2019-12-02 09:23:41.909  INFO 3998 [tcher-worker-13] Main                                     : DefaultDispatcher-worker-13 DDSpan [ t_id=2654422379082193946, s_id=7556649444744964666, p_id=5753717729389507729] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={Iteration=8, component=trace, thread.id=37, thread.name=DefaultDispatcher-worker-13}, duration_ns=0
2019-12-02 09:23:42.942  INFO 3998 [atcher-worker-3] Main                                     : DefaultDispatcher-worker-3 DDSpan [ t_id=8401234232036504370, s_id=481145876484775946, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=27, thread.name=DefaultDispatcher-worker-3}, duration_ns=0
[dd.trace 2019-12-02 09:23:43:587 +0100] [dd-trace-writer] WARN datadog.trace.agent.common.writer.DDApi - Error while sending 3 of 6 traces to the DD agent. java.net.ConnectException: Failed to connect to localhost/0:0:0:0:0:0:0:1:8126 (going silent for 5 minutes)
2019-12-02 09:23:43.946  INFO 3998 [atcher-worker-1] Main                                     : DefaultDispatcher-worker-1 DDSpan [ t_id=1567002908161829012, s_id=857742505563499203, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=25, thread.name=DefaultDispatcher-worker-1}, duration_ns=0
2019-12-02 09:23:44.951  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=2602817791208366767, s_id=926635854296932000, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:45.953  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=7206127938673410068, s_id=6063878620474869192, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:46.957  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=4634851895448235703, s_id=3472674439697055107, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:47.963  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=3557106430153605668, s_id=8402253980305412920, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:48.967  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=6980458062161759504, s_id=665376991271971413, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0
2019-12-02 09:23:49.972  INFO 3998 [atcher-worker-7] Main                                     : DefaultDispatcher-worker-7 DDSpan [ t_id=4854400913329068137, s_id=7033435703755271922, p_id=0] trace=unnamed-java-app/trace.annotation/Main.waitFor metrics={} tags={component=trace, language=jvm, runtime-id=fedd6bdd-2499-451c-bb36-0f662f3e8648, thread.id=31, thread.name=DefaultDispatcher-worker-7}, duration_ns=0

Do you have any workaround for this?

Thanks.

ikovalyov avatar Dec 02 '19 08:12 ikovalyov

Thanks a ton for the example. I think this has to do with annotating suspend methods with @Trace. Suspend methods behave sequentially in kotlin but under the hood they are invoked async, almost like passing in a callback. A secondary parameter is added to method signatures and a continuation is passed in so that the coroutine can proceed with the execution of its subsequent statements. This particular use case isn’t being exercised in the existing tests at the moment but I do think its a big gap in Kotlin support.

I’ll need to dig into the logic surrounding method level instrumentation but this example project is really going to help.

marcoferrer avatar Dec 04 '19 17:12 marcoferrer

Hi @marcoferrer I'm not really familiar with internal dd implementation of the tracing API, but I have few ideas on what can cause this issue.

OpenTracing API is saying that there can be only 1 active span in the thread. I guess DD java agent is using thread local to store this information.

Unfortunately in kotlin there can be multiple coroutines running in the same thread, each of them having it's own active span. Continuation logic is processed within internal kotlin event loop where coroutine state is being checked. Also it's possible that coroutine started in one thread will be finished in another thread depending on inner logic (IO reads for example).

This is known issue and kotlin team already discussed how to achieve ThreadLocal-like functionality inside of the coroutines. The rough idea is that developer creates variable in the coroutine scope which is restored on coroutine context change. Take a look at this discussion: https://github.com/Kotlin/kotlinx.coroutines/issues/119

There is a special CoroutineContext function designedto work with ThreadLocal variables and restoring them: https://github.com/Kotlin/kotlinx.coroutines/blob/master/kotlinx-coroutines-core/jvm/src/ThreadContextElement.kt

Unfortunately using these solutions will require to add dependency on the kotlin coroutine core to your project.

Please let me know what you think.

Thanks.

ikovalyov avatar Jan 06 '20 08:01 ikovalyov

Hello @ikovalyov , thank you for the example to work on! I have opened a pull request to demonstrate how my company worked around this issue, using manual instrumentation. I hope it can be of some use to you/DD.

> Task :run
[dd.trace 2020-02-02 21:46:19:413 +0000] [main] INFO datadog.trace.agent.jmxfetch.JMXFetch - JMXFetch config: null [] [] [] null null {runtime-id=d0ba682d-6be3-4899-bb58-1a891d737499, service=unnamed-java-app} statsd:localhost:8125
[dd.trace 2020-02-02 21:46:19:975 +0000] [main] INFO datadog.trace.agent.ot.DDTraceOTInfo - dd-trace - version: 0.42.0~f78b34fd
[dd.trace 2020-02-02 21:46:20:033 +0000] [main] INFO datadog.trace.agent.ot.DDTracer - New instance: DDTracer-51ec2856{ serviceName=unnamed-java-app, writer=DDAgentWriter { api=DDApi { tracesUrl=http://localhost:8126/v0.3/traces } }, sampler=datadog.trace.agent.common.sampling.RateByServiceSampler@714f3da4, defaultSpanTags={}}
[dd.trace 2020-02-02 21:46:20:056 +0000] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-trace-ot - version: 0.42.0~f78b34fd
[dd.trace 2020-02-02 21:46:20:056 +0000] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-trace-api - version: 0.42.0~f78b34fd
[dd.trace 2020-02-02 21:46:20:056 +0000] [main] INFO datadog.trace.agent.tooling.VersionLogger - dd-java-agent - version: 0.42.0~f78b34fd
21:46:21.322  INFO : === INIT === [ t_id=7035779805910583268, s_id=6979527234783672340, p_id=0] main}
21:46:21.464  INFO : ++++ IN +++: 5; [ t_id=7035779805910583268, s_id=3210259951591078075, p_id=6979527234783672340] DefaultDispatcher-worker-6
21:46:21.464  INFO : ++++ IN +++: 3; [ t_id=7035779805910583268, s_id=7267272080838780582, p_id=6979527234783672340] DefaultDispatcher-worker-4
21:46:21.464  INFO : ++++ IN +++: 1; [ t_id=7035779805910583268, s_id=7022831019533687639, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:21.464  INFO : ++++ IN +++: 6; [ t_id=7035779805910583268, s_id=3530788139845839964, p_id=6979527234783672340] DefaultDispatcher-worker-7
21:46:21.464  INFO : ++++ IN +++: 9; [ t_id=7035779805910583268, s_id=7568398546412052017, p_id=6979527234783672340] DefaultDispatcher-worker-10
21:46:21.464  INFO : ++++ IN +++: 2; [ t_id=7035779805910583268, s_id=4861729612813258103, p_id=6979527234783672340] DefaultDispatcher-worker-2
21:46:21.464  INFO : ++++ IN +++: 7; [ t_id=7035779805910583268, s_id=5300620202162984080, p_id=6979527234783672340] DefaultDispatcher-worker-11
21:46:21.464  INFO : ++++ IN +++: 4; [ t_id=7035779805910583268, s_id=9080709432457601673, p_id=6979527234783672340] DefaultDispatcher-worker-5
21:46:21.464  INFO : ++++ IN +++: 8; [ t_id=7035779805910583268, s_id=814191239339242963, p_id=6979527234783672340] DefaultDispatcher-worker-9
21:46:21.465  INFO : --- OUT ---: 1; [ t_id=7035779805910583268, s_id=7022831019533687639, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:21.464  INFO : ++++ IN +++: 0; [ t_id=7035779805910583268, s_id=5179139853462982352, p_id=6979527234783672340] DefaultDispatcher-worker-1
21:46:21.465  INFO : --- OUT ---: 0; [ t_id=7035779805910583268, s_id=5179139853462982352, p_id=6979527234783672340] DefaultDispatcher-worker-1
21:46:22.506  INFO : --- OUT ---: 2; [ t_id=7035779805910583268, s_id=4861729612813258103, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:23.508  INFO : --- OUT ---: 3; [ t_id=7035779805910583268, s_id=7267272080838780582, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:24.511  INFO : --- OUT ---: 4; [ t_id=7035779805910583268, s_id=9080709432457601673, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:25.513  INFO : --- OUT ---: 5; [ t_id=7035779805910583268, s_id=3210259951591078075, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:26.514  INFO : --- OUT ---: 6; [ t_id=7035779805910583268, s_id=3530788139845839964, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:27.519  INFO : --- OUT ---: 7; [ t_id=7035779805910583268, s_id=5300620202162984080, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:28.522  INFO : --- OUT ---: 8; [ t_id=7035779805910583268, s_id=814191239339242963, p_id=6979527234783672340] DefaultDispatcher-worker-3
21:46:29.526  INFO : --- OUT ---: 9; [ t_id=7035779805910583268, s_id=7568398546412052017, p_id=6979527234783672340] DefaultDispatcher-worker-3
[dd.trace 2020-02-02 21:46:31:083 +0000] [Thread-3] INFO datadog.trace.agent.common.writer.ddagent.TraceSerializingDisruptor - Flushing any remaining traces.

MxKaras avatar Feb 02 '20 21:02 MxKaras

@MxKaras thanks, I'll try this approach!

ikovalyov avatar Feb 04 '20 12:02 ikovalyov

Looking back at it, our initial implementation of the ThreadContextElement was a bit naive. @ikovalyov if it helps -

class SpanContextElement(val span: Span) : ThreadContextElement<Scope> {
    override val key = Key

    // this is invoked before coroutine is resumed on current thread
    override fun updateThreadContext(context: CoroutineContext): Scope {
        return GlobalTracer.get().activateSpan(span)
    }

    // this is invoked after coroutine has suspended on current thread
    override fun restoreThreadContext(context: CoroutineContext, oldState: Scope) {
        oldState.close()
    }

    companion object Key : CoroutineContext.Key<SpanContextElement>
}

MxKaras avatar Apr 17 '20 16:04 MxKaras

Hi @MxKaras I approved your MR and added more testing code there. Unfortunately more issues revealed :(

Lets say we have 2 coroutines launched from parent span. They both have their own span connected to the parent span. Both of them have this SpanContextElement in use. Both of them are performing some async operations. For some reason, when first coroutine finishes while second coroutine is awaiting parent span is restored in the second coroutine instead of the correct one

15:15:36.319  INFO : --- OUT ---: 9; DDSpan [ t_id=2011767087483520335, s_id=3155681079097456617, p_id=0] trace=Main/trace.annotation/trace.annotation metrics={} tags={component=trace, language=jvm, resource.name=Main.main, runtime-id=febd6773-cf00-4347-9d3c-1f4ed3433742, thread.id=1, thread.name=main}, duration_ns=0
15:15:36.330  INFO : --- OUT ---: 8; DDSpan [ t_id=2011767087483520335, s_id=2642583294064030532, p_id=3155681079097456617] trace=Main/launch 8/launch 8 metrics={} tags={Iteration=8, thread.id=1, thread.name=main}, duration_ns=0
DDSpan [ t_id=2011767087483520335, s_id=2642583294064030532, p_id=3155681079097456617] trace=Main/launch 8/launch 8 metrics={} tags={Iteration=8, thread.id=1, thread.name=main}, duration_ns=8112597105
--> 15:15:36.331  INFO : --- MIDDLE ---: 9; DDSpan [ t_id=2011767087483520335, s_id=3155681079097456617, p_id=0] trace=Main/trace.annotation/trace.annotation metrics={} tags={component=trace, language=jvm, resource.name=Main.main, runtime-id=febd6773-cf00-4347-9d3c-1f4ed3433742, thread.id=1, thread.name=main}, duration_ns=0
DDSpan [ t_id=2011767087483520335, s_id=1805041717008322454, p_id=3155681079097456617] trace=Main/9/9 metrics={} tags={Iteration=9, thread.id=1, thread.name=main}, duration_ns=9114222553
--> 15:15:37.332  INFO : --- OUT ---: 9; DDSpan [ t_id=2011767087483520335, s_id=3155681079097456617, p_id=0] trace=Main/trace.annotation/trace.annotation metrics={} tags={component=trace, language=jvm, resource.name=Main.main, runtime-id=febd6773-cf00-4347-9d3c-1f4ed3433742, thread.id=1, thread.name=main}, duration_ns=0
DDSpan [ t_id=2011767087483520335, s_id=6106499147502182028, p_id=3155681079097456617] trace=Main/launch 9/launch 9 metrics={} tags={Iteration=9, thread.id=1, thread.name=main}, duration_ns=9109520285


Not sure if you had this issue in your production

ikovalyov avatar Aug 14 '20 13:08 ikovalyov

Solved it by adding the -Ddd.trace.scope.depth.limit=0 configuration parameter

ikovalyov avatar Aug 17 '20 08:08 ikovalyov

did you try using a continuation interceptor? https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.coroutines/-continuation-interceptor/

peya-lucas-milotich avatar Apr 07 '21 14:04 peya-lucas-milotich

Hi @MxKaras, do you have some working solution for more uptodate datadog agent implementations?

tekener avatar Feb 04 '22 12:02 tekener

My apologies @tekener - I did not notice my notifications for a long time now! I do not work at the company anymore that I was at when I wrote those messages. But the solution above was good enough for our needs (most of the time... there were a few services where the traces didn't work properly... never found out why). I think maybe that we were never creating nested spans in the way that @ikovalyov was.

Before I left that job I was looking at opentelemetry. There seemed to be some promising extensions for coroutines support, but I never got round to playing with it 🤷‍♂️ . Sorry I couldn't be more helpful!

MxKaras avatar Feb 15 '22 18:02 MxKaras

I wouldn't recommend using the sample provided here: https://github.com/DataDog/dd-trace-java/issues/1123#issuecomment-615354579 , as that would cause errors/memory leaks.

Here's an example:

import datadog.opentracing.DDTracer.DDTracerBuilder
import io.opentracing.Scope
import io.opentracing.Span
import io.opentracing.util.GlobalTracer
import kotlinx.coroutines.ThreadContextElement
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.time.delay
import org.junit.jupiter.api.BeforeEach
import org.junit.jupiter.api.Test
import org.slf4j.LoggerFactory
import java.time.Duration
import kotlin.coroutines.CoroutineContext

internal class TracerExtensionsKtTest {

    private val logger = LoggerFactory.getLogger("testLogger")

    @BeforeEach
    internal fun setUp() {
        DDTracerBuilder()
            .build()
            .also(GlobalTracer::registerIfAbsent)
    }

    class SpanContextElement(val span: Span) : ThreadContextElement<Scope> {
        override val key = Key

        // this is invoked before coroutine is resumed on current thread
        override fun updateThreadContext(context: CoroutineContext): Scope {
            return GlobalTracer.get().activateSpan(span)
        }

        // this is invoked after coroutine has suspended on current thread
        override fun restoreThreadContext(context: CoroutineContext, oldState: Scope) {
            oldState.close()
        }

        companion object Key : CoroutineContext.Key<SpanContextElement>
    }

    @Test
    internal fun test() {

        val topLevelSpan = GlobalTracer.get().buildSpan("top-level-span").start()
        val topLevelScope = GlobalTracer.get().activateSpan(topLevelSpan)
        topLevelScope.use {

            runBlocking(SpanContextElement(topLevelSpan)) {
                val uninterruptedSpan = GlobalTracer.get().buildSpan("uninterrupted-span").start()
                val uninterruptedScope = GlobalTracer.get().activateSpan(uninterruptedSpan)
                uninterruptedScope.use {
                    logger.info("Hello from uninterrupted span!")
                    uninterruptedSpan.finish()
                }

                val interruptedSpan = GlobalTracer.get().buildSpan("interrupted-span").start()
                val interruptedScope = GlobalTracer.get().activateSpan(interruptedSpan)
                interruptedScope.use {
                    logger.info("Hello from interrupted span!")
                    delay(Duration.ofMillis(100))
                    interruptedSpan.finish()
                }
            }

            topLevelSpan.finish()
        }

    }
}

The coroutine will get parked here when we call delay(), leading to a call to SpanContextElement#restoreThreadContext while interruptedScope is still in use (not closed). When the coroutine will continue after the delay SpanContextElement#activateSpan will be called, creating a new scope. This might cause memory leaks, since the stack to which interruptedScope belongs might not get cleaned. As a side effect you will also see the following log messages:

14:50:58.803 [main] DEBUG d.t.c.s.ContinuableScopeManager - Tried to close datadog.trace.core.scopemanager.ContinuableScopeManager$ContinuableScope@37efd131->DDSpan [ t_id=3440430157635651782, s_id=154419290588238380, p_id=0 ] trace=com.intellij.rt.junit.JUnitStarter/top-level-span/top-level-span metrics= tags={language=jvm, runtime-id=1c59fbd5-69df-42bf-a07f-bcfe2d9fb227, thread.id=1, thread.name=main}, duration_ns=0 scope when not on top.  Current top: datadog.trace.core.scopemanager.ContinuableScopeManager$ContinuableScope@7e7b159b->DDSpan [ t_id=3440430157635651782, s_id=1361391474773844142, p_id=154419290588238380 ] trace=com.intellij.rt.junit.JUnitStarter/interrupted-span/interrupted-span metrics= tags={thread.id=1, thread.name=main @coroutine#1}, duration_ns=0 | MDC: []
14:50:58.903 [main @coroutine#1] DEBUG d.t.c.s.ContinuableScopeManager - Tried to close datadog.trace.core.scopemanager.ContinuableScopeManager$ContinuableScope@7e7b159b->DDSpan [ t_id=3440430157635651782, s_id=1361391474773844142, p_id=154419290588238380 ] trace=com.intellij.rt.junit.JUnitStarter/interrupted-span/interrupted-span metrics= tags={thread.id=1, thread.name=main @coroutine#1}, duration_ns=102527964 scope when not on top.  Current top: datadog.trace.core.scopemanager.ContinuableScopeManager$ContinuableScope@293a5f75->DDSpan [ t_id=3440430157635651782, s_id=154419290588238380, p_id=0 ] trace=com.intellij.rt.junit.JUnitStarter/top-level-span/top-level-span metrics= tags={language=jvm, runtime-id=1c59fbd5-69df-42bf-a07f-bcfe2d9fb227, thread.id=1, thread.name=main}, duration_ns=0 | MDC: []

Also, you will notice counter scope.close.error getting incremented.

See DD source for more details: https://github.com/DataDog/dd-trace-java/blob/90a648fd19d3c3114e29957b11147332c1d30cb9/dd-trace-core/src/main/java/datadog/trace/core/scopemanager/ContinuableScopeManager.java#L272

monosoul avatar Apr 29 '22 13:04 monosoul

@monosoul Can you please share the solution to overcome above limitation?

vsingal-p avatar Dec 03 '22 19:12 vsingal-p

:robot: This issue has been addressed in the latest release. See full details in the Release Notes.

github-actions[bot] avatar Feb 08 '23 00:02 github-actions[bot]