playframework MDC propagation sometimes disrupted by Netty event loop threads

Per discussion on possible cause of occasional loss of MDC propagation, it seems that NettyServer should be passing an Executor to Netty:

https://github.com/playframework/playframework/blob/92ffcbda314cbb05a454a72fc99e9cbc92b3579f/framework/src/play-netty-server/src/main/scala/play/core/server/NettyServer.scala#L77-L78

Play Version (2.5.x / etc)

2.5.15

API (Scala / Java / Neither / Both)

Scala

Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)

Linux 770944747-jsngb 4.4.35+ #1 SMP Wed Apr 5 13:00:57 PDT 2017 x86_64 GNU/Linux

JDK (Oracle 1.8.0_72, OpenJDK 1.8.x, Azul Zing)

openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

Expected Behavior

Please describe the expected behavior of the issue, starting from the first action.

logback is configured to include MDC
play Filter sets a requestId in the MDC
all log statements should include the MDC

Actual Behavior

There are occasional cases where the MDC is improperly empty.

Reproducible Test Case

I do not have a reproducible test case at this point. From the thread linked above, and copied here, are some log statements that illustrate the issue

override def prepare(): ExecutionContext = new ExecutionContext {
  // capture the MDC
  val mdcContext = MDC.getCopyOfContextMap
  val id = UUID.randomUUID().toString
  logger.debug(s"${id} mdcContext=${mdcContext}")
  def execute(r: Runnable) = self.execute(new Runnable {
    def run() = {
      // backup the callee MDC context
      val oldMDCContext = MDC.getCopyOfContextMap
      logger.debug(s"${id} oldMDCContext=${oldMDCContext}")
      // Run the runnable with the captured context
      setContextMap(mdcContext)
      logger.debug(s"${id} contextMap set")
      try {
        r.run()
      } finally {
        // restore the callee MDC context
        logger.debug(s"${id} Restoring to ${oldMDCContext}")
        setContextMap(oldMDCContext)
        logger.debug(s"${id} Restored")
      }
    }
  })
  def reportFailure(t: Throwable) = self.reportFailure(t)
}

In rare cases, the MDC is lost by the time the controller executes.

2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.b.u.HttpContextFilter:29 requestId set to KZ7iVA0XTR2/rHFoelN9sw
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 e839a3d3-f868-4130-9ce1-f81fa64a73fb mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 2e89056b-9773-4587-81d6-4241e4872622 mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 cd5cebcc-71df-431b-8707-bbc5261c1731 Restoring to null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:82 cd5cebcc-71df-431b-8707-bbc5261c1731 Restored
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:72 2e89056b-9773-4587-81d6-4241e4872622 oldMDCContext=null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:75 2e89056b-9773-4587-81d6-4241e4872622 contextMap set
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 2e89056b-9773-4587-81d6-4241e4872622 Restoring to null
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:82 2e89056b-9773-4587-81d6-4241e4872622 Restored
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 d8c74914-7524-4c2c-be33-7155d08487b8 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 2e6bc121-5058-4230-ab9a-c39dd1e75568 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0ca6968c-cd06-4850-86c5-9114ec6b495f mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0a8ac5a3-7cba-4bdd-bae1-386140a580d2 mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:72 2e6bc121-5058-4230-ab9a-c39dd1e75568 oldMDCContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:75 2e6bc121-5058-4230-ab9a-c39dd1e75568 contextMap set
2017-06-13 06:56:44,028 [INFO ] [application-akka.actor.default-dispatcher-1132] [] c.e.n.b.c.Controller:24 ***MDC IS NULL HERE***

Jun 20 '17 00:06 jsw

@jsw Do you have a PR you can submit? We're planning for 2.6.0 GA, so we don't have a lot of time free.

Jun 20 '17 01:06 wsargent

@wsargent Happy to give it a shot.

I'm not sure what the appropriate Executor to pass to Netty is. I see that one can be derived from either the ActorSystem or Materializer constructor parameters, though it's not clear that Netty should be using the application ExecutionContext or a different one.

I'm also not sure how this will affect the underlying issue with the MDC propagation. Based on the log statements above, it seems that the netty-event-loop threads are flowing through the MDCPropagatingDispatcher, perhaps inappropriately. I thought only the akka threads are supposed to be doing that. I can gather another clue by logging the stack trace whenever a thread with name of netty-event-loop* flows through the MDCPropagatingDispatcher.

Verifying the solution fixes the issue will be a bit challenging as I'll need to migrate to 2.6.

Whether it does or not, if the expected code change is obvious, I'm happy to work on it.

Jun 20 '17 16:06 jsw

Upon further investigation, it makes sense that a netty-event-loop thread will enter MDCPropagatingDispatcher.prepare here, handing off to an akka thread in MDCPropagatingDispatcher.run.

Given that, it seems to make sense why the MDC propagation is interrupted. Unless there is something else I'm missing, it seems that Netty must run in the same ExecutionContext as the rest of the app, but I'm not sure that even helps since Netty is Java, and therefore not calling ExecutionContext.prepare.

Tagging @yanns, who may have additional insight.

Jun 20 '17 17:06 jsw

Yeah, there needs to be a "flow of control" switch between Netty (server side) and Akka (user / app flow) internally. It's actually a sticking point as it impacts performance.

Jun 20 '17 17:06 wsargent

For info, this is the code I am now using:

/**
 * propagates the logback MDC in future callbacks
 */
trait MDCPropagatorExecutionContext extends ExecutionContext {
  self ⇒

  override def prepare(): ExecutionContext = new ExecutionContext {
    // capture the MDC
    val mdcContext = MDC.getCopyOfContextMap

    def execute(r: Runnable) = self.execute(new Runnable {
      def run() = {
        // backup the callee MDC context
        val oldMDCContext = MDC.getCopyOfContextMap

        // Run the runnable with the captured context
        setContextMap(mdcContext)
        try {
          r.run()
        } finally {
          // restore the callee MDC context
          setContextMap(oldMDCContext)
        }
      }
    })
    def reportFailure(t: Throwable) = self.reportFailure(t)
  }

  private[this] def setContextMap(context: java.util.Map[String, String]) {
    if (context == null) {
      MDC.clear()
    } else {
      MDC.setContextMap(context)
    }
  }

}

object MDCPropagatorExecutionContext {
  def apply(delegate: ExecutionContext): MDCPropagatorExecutionContext =
    new ExecutionContext with MDCPropagatorExecutionContext {
      override def reportFailure(cause: Throwable): Unit = delegate.reportFailure(cause)
      override def execute(runnable: Runnable): Unit = delegate.execute(runnable)
    }
}

This way, I can mix this in any Executor. And for Akka, I can use a dispatcher:

class MDCPropagatingDispatcherConfigurator(config: Config, prerequisites: DispatcherPrerequisites)
  extends MessageDispatcherConfigurator(config, prerequisites) {

  private val instance = new Dispatcher(
    this,
    config.getString("id"),
    config.getInt("throughput"),
    FiniteDuration(config.getDuration("throughput-deadline-time", TimeUnit.NANOSECONDS), TimeUnit.NANOSECONDS),
    configureExecutor(),
    FiniteDuration(config.getDuration("shutdown-timeout", TimeUnit.MILLISECONDS), TimeUnit.MILLISECONDS))
    with MDCPropagatorExecutionContext

  override def dispatcher(): MessageDispatcher = instance
}

It's not a must that netty uses the same Execution as the application. They can use different Executions, as long as they all mix MDCPropagatorExecutionContext in.

Jun 20 '17 20:06 yanns

@yanns Thanks for sharing your updated implementation. Looks good.

Since Netty is Java and using Executor instead of ExecutionContext, it won't call ExecutionContext.prepare(), so how will it work?

Jun 20 '17 20:06 jsw

Also see https://github.com/playframework/playframework/issues/2581

Jun 20 '17 20:06 wsargent

Note that ExecutionContext.prepare has been deprecated in Scala 2.12. The reason is that it was required all methods with an ExecutionContext argument to prepare it immediately and never use the implicit unprepared ExecutionContext—this was very error-prone.

No good alternative is currently available, unfortunately. I've explored a couple of ways of doing it, but nothing is currently ready. :(

Jun 25 '17 23:06 richdougherty

Funny, I was literally just reading this blog post moments ago describing the same http://viktorklang.com/blog/Futures-in-Scala-2.12-part-7.html

Jun 25 '17 23:06 jsw

I haven't looked into the details yet, but the Twitter Futures propagate context across threads. I wonder why that aspect wasn't adopted into Scala Futures. https://twitter.github.io/finagle/guide/Contexts.html

Jun 25 '17 23:06 jsw

I think we looked at that implementation when deprecation of prepare was originally discussed: https://github.com/scala/scala/pull/4042#issuecomment-58509573. Some of the discussion happened on another PR, which seems to have been deleted (@viktorklang?). A whole redesign is probably needed to support propagating contexts.

Jun 25 '17 23:06 richdougherty

I also remember one PR from @viktorklang where he proposed some changes, and where we discussed why we need prepare for context propagation. (I cannot find this PR anymore - it was maybe the deleted https://github.com/viktorklang/scala/issues/7) I also understand that this prepare method is so easy to forget/difficult to use correctly that he would like to remove it in the future. It means that we do not really have a good solution (yet) for context propagation in the future.

Jun 26 '17 06:06 yanns

@yanns @richdougherty There's no current proposal for replacement of prepare, the problem with it is well-understood—it's just too hard to use it correctly and be sure about the effects.

Jun 26 '17 08:06 viktorklang

So this is discussed in the highlights a bit: you can pass around a context implicitly through your code and then rely on the MarkerContext to provide that information for you:

https://github.com/playframework/play-scala-chatroom-example/blob/2.6.x/app/controllers/RequestMarkerContext.scala

You could probably use a context bounds and save on the implicit as well:

private def doStuff[A : Request]() = {
  logger.info("takes an implicit Request[A] and uses RequestMarkerContext to autoconvert")
}

Jun 27 '17 00:06 wsargent

Yes, the MarkerContext is a good solution, but limited to the user code. If any library is using logging, it does not work anymore. One can use MarkerContext for the application code, and set the MDC context when calling library code.

Jun 27 '17 05:06 yanns

Does anyone know if this is still an issue on version 2.6.x when using Akka HTTP server backend?

Jul 03 '17 20:07 pauloborges

I'm going to mark this as a defect because I think we need to come up with a solution to this kind of problem.

Jul 03 '17 21:07 richdougherty

Link to discussion I started on this topic. https://contributors.scala-lang.org/t/locals-in-scala-concurrent/1785/4

Apr 09 '18 15:04 jsw

Lightbend Telemetry claims to propagate MDC properly - has anyone used this?

https://developer.lightbend.com/docs/cinnamon/current/extensions/mdc.html

On second glance, Telemetry requires a license - another possibility using Akka's DiagnosticLoggingAdapter:

https://doc.akka.io/docs/akka/2.5/logging.html#mdc-values-defined-by-the-application

Oct 30 '18 00:10 reed-sandberg

There is only one way to make it works -- usage of aspects I'm almost sure that Lightbend have implemented this in this way. Generally, this is not so complex as it seems, as an example or maybe even as a solution, can be used Kamon logback project https://github.com/kamon-io/kamon-logback

Dec 19 '18 13:12 kuzkdmy

Any update on this? I'm a Java developer and finding a solution is a nightmare, managed to translate many scala workarounds but nothing seems to work in runtime.

Jan 18 '21 12:01 almothafar

Is there any intention to address this in it's current state?

On Scala, even when providing an ExecutionContext where prepare will copy the MDC across. I have seen that MDC is lost in Filters and in Actions whenever the execution is picked up by a different thread.

It can be made to work reliably with the following changes. Is the undoing of the Accumulator Future optimisation the performance impact that is mentioned above?

Oct 11 '22 11:10 colin-lamed