MDC propagation sometimes disrupted by Netty event loop threads
Per discussion on possible cause of occasional loss of MDC propagation, it seems that NettyServer should be passing an Executor to Netty:
https://github.com/playframework/playframework/blob/92ffcbda314cbb05a454a72fc99e9cbc92b3579f/framework/src/play-netty-server/src/main/scala/play/core/server/NettyServer.scala#L77-L78
Play Version (2.5.x / etc)
2.5.15
API (Scala / Java / Neither / Both)
Scala
Operating System (Ubuntu 15.10 / MacOS 10.10 / Windows 10)
Linux 770944747-jsngb 4.4.35+ #1 SMP Wed Apr 5 13:00:57 PDT 2017 x86_64 GNU/Linux
JDK (Oracle 1.8.0_72, OpenJDK 1.8.x, Azul Zing)
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
Expected Behavior
Please describe the expected behavior of the issue, starting from the first action.
- logback is configured to include MDC
- play Filter sets a requestId in the MDC
- all log statements should include the MDC
Actual Behavior
- There are occasional cases where the MDC is improperly empty.
Reproducible Test Case
I do not have a reproducible test case at this point. From the thread linked above, and copied here, are some log statements that illustrate the issue
override def prepare(): ExecutionContext = new ExecutionContext {
// capture the MDC
val mdcContext = MDC.getCopyOfContextMap
val id = UUID.randomUUID().toString
logger.debug(s"${id} mdcContext=${mdcContext}")
def execute(r: Runnable) = self.execute(new Runnable {
def run() = {
// backup the callee MDC context
val oldMDCContext = MDC.getCopyOfContextMap
logger.debug(s"${id} oldMDCContext=${oldMDCContext}")
// Run the runnable with the captured context
setContextMap(mdcContext)
logger.debug(s"${id} contextMap set")
try {
r.run()
} finally {
// restore the callee MDC context
logger.debug(s"${id} Restoring to ${oldMDCContext}")
setContextMap(oldMDCContext)
logger.debug(s"${id} Restored")
}
}
})
def reportFailure(t: Throwable) = self.reportFailure(t)
}
In rare cases, the MDC is lost by the time the controller executes.
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.b.u.HttpContextFilter:29 requestId set to KZ7iVA0XTR2/rHFoelN9sw
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 e839a3d3-f868-4130-9ce1-f81fa64a73fb mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:67 2e89056b-9773-4587-81d6-4241e4872622 mdcContext={requestId=KZ7iVA0XTR2/rHFoelN9sw}
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 cd5cebcc-71df-431b-8707-bbc5261c1731 Restoring to null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:82 cd5cebcc-71df-431b-8707-bbc5261c1731 Restored
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:72 2e89056b-9773-4587-81d6-4241e4872622 oldMDCContext=null
2017-06-13 06:56:44,026 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:75 2e89056b-9773-4587-81d6-4241e4872622 contextMap set
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [requestId=KZ7iVA0XTR2/rHFoelN9sw] c.e.n.a.MDCPropagatingDispatcher:80 2e89056b-9773-4587-81d6-4241e4872622 Restoring to null
2017-06-13 06:56:44,027 [DEBUG] [application-akka.actor.default-dispatcher-1125] [] c.e.n.a.MDCPropagatingDispatcher:82 2e89056b-9773-4587-81d6-4241e4872622 Restored
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 d8c74914-7524-4c2c-be33-7155d08487b8 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 2e6bc121-5058-4230-ab9a-c39dd1e75568 mdcContext=null
2017-06-13 06:56:44,027 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0ca6968c-cd06-4850-86c5-9114ec6b495f mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [netty-event-loop-13] [] c.e.n.a.MDCPropagatingDispatcher:67 0a8ac5a3-7cba-4bdd-bae1-386140a580d2 mdcContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:72 2e6bc121-5058-4230-ab9a-c39dd1e75568 oldMDCContext=null
2017-06-13 06:56:44,028 [DEBUG] [application-akka.actor.default-dispatcher-1132] [] c.e.n.a.MDCPropagatingDispatcher:75 2e6bc121-5058-4230-ab9a-c39dd1e75568 contextMap set
2017-06-13 06:56:44,028 [INFO ] [application-akka.actor.default-dispatcher-1132] [] c.e.n.b.c.Controller:24 ***MDC IS NULL HERE***
@jsw Do you have a PR you can submit? We're planning for 2.6.0 GA, so we don't have a lot of time free.
@wsargent Happy to give it a shot.
I'm not sure what the appropriate Executor to pass to Netty is. I see that one can be derived from either the ActorSystem or Materializer constructor parameters, though it's not clear that Netty should be using the application ExecutionContext or a different one.
I'm also not sure how this will affect the underlying issue with the MDC propagation. Based on the log statements above, it seems that the netty-event-loop threads are flowing through the MDCPropagatingDispatcher, perhaps inappropriately. I thought only the akka threads are supposed to be doing that. I can gather another clue by logging the stack trace whenever a thread with name of netty-event-loop* flows through the MDCPropagatingDispatcher.
Verifying the solution fixes the issue will be a bit challenging as I'll need to migrate to 2.6.
Whether it does or not, if the expected code change is obvious, I'm happy to work on it.
Upon further investigation, it makes sense that a netty-event-loop thread will enter MDCPropagatingDispatcher.prepare here, handing off to an akka thread in MDCPropagatingDispatcher.run.
Given that, it seems to make sense why the MDC propagation is interrupted. Unless there is something else I'm missing, it seems that Netty must run in the same ExecutionContext as the rest of the app, but I'm not sure that even helps since Netty is Java, and therefore not calling ExecutionContext.prepare.
Tagging @yanns, who may have additional insight.
Yeah, there needs to be a "flow of control" switch between Netty (server side) and Akka (user / app flow) internally. It's actually a sticking point as it impacts performance.
For info, this is the code I am now using:
/**
* propagates the logback MDC in future callbacks
*/
trait MDCPropagatorExecutionContext extends ExecutionContext {
self ⇒
override def prepare(): ExecutionContext = new ExecutionContext {
// capture the MDC
val mdcContext = MDC.getCopyOfContextMap
def execute(r: Runnable) = self.execute(new Runnable {
def run() = {
// backup the callee MDC context
val oldMDCContext = MDC.getCopyOfContextMap
// Run the runnable with the captured context
setContextMap(mdcContext)
try {
r.run()
} finally {
// restore the callee MDC context
setContextMap(oldMDCContext)
}
}
})
def reportFailure(t: Throwable) = self.reportFailure(t)
}
private[this] def setContextMap(context: java.util.Map[String, String]) {
if (context == null) {
MDC.clear()
} else {
MDC.setContextMap(context)
}
}
}
object MDCPropagatorExecutionContext {
def apply(delegate: ExecutionContext): MDCPropagatorExecutionContext =
new ExecutionContext with MDCPropagatorExecutionContext {
override def reportFailure(cause: Throwable): Unit = delegate.reportFailure(cause)
override def execute(runnable: Runnable): Unit = delegate.execute(runnable)
}
}
This way, I can mix this in any Executor.
And for Akka, I can use a dispatcher:
class MDCPropagatingDispatcherConfigurator(config: Config, prerequisites: DispatcherPrerequisites)
extends MessageDispatcherConfigurator(config, prerequisites) {
private val instance = new Dispatcher(
this,
config.getString("id"),
config.getInt("throughput"),
FiniteDuration(config.getDuration("throughput-deadline-time", TimeUnit.NANOSECONDS), TimeUnit.NANOSECONDS),
configureExecutor(),
FiniteDuration(config.getDuration("shutdown-timeout", TimeUnit.MILLISECONDS), TimeUnit.MILLISECONDS))
with MDCPropagatorExecutionContext
override def dispatcher(): MessageDispatcher = instance
}
It's not a must that netty uses the same Execution as the application. They can use different Executions, as long as they all mix MDCPropagatorExecutionContext in.
@yanns Thanks for sharing your updated implementation. Looks good.
Since Netty is Java and using Executor instead of ExecutionContext, it won't call ExecutionContext.prepare(), so how will it work?
Also see https://github.com/playframework/playframework/issues/2581
Note that ExecutionContext.prepare has been deprecated in Scala 2.12. The reason is that it was required all methods with an ExecutionContext argument to prepare it immediately and never use the implicit unprepared ExecutionContext—this was very error-prone.
No good alternative is currently available, unfortunately. I've explored a couple of ways of doing it, but nothing is currently ready. :(
Funny, I was literally just reading this blog post moments ago describing the same http://viktorklang.com/blog/Futures-in-Scala-2.12-part-7.html
I haven't looked into the details yet, but the Twitter Futures propagate context across threads. I wonder why that aspect wasn't adopted into Scala Futures. https://twitter.github.io/finagle/guide/Contexts.html
I think we looked at that implementation when deprecation of prepare was originally discussed: https://github.com/scala/scala/pull/4042#issuecomment-58509573. Some of the discussion happened on another PR, which seems to have been deleted (@viktorklang?). A whole redesign is probably needed to support propagating contexts.
I also remember one PR from @viktorklang where he proposed some changes, and where we discussed why we need prepare for context propagation. (I cannot find this PR anymore - it was maybe the deleted https://github.com/viktorklang/scala/issues/7)
I also understand that this prepare method is so easy to forget/difficult to use correctly that he would like to remove it in the future.
It means that we do not really have a good solution (yet) for context propagation in the future.
@yanns @richdougherty There's no current proposal for replacement of prepare, the problem with it is well-understood—it's just too hard to use it correctly and be sure about the effects.
So this is discussed in the highlights a bit: you can pass around a context implicitly through your code and then rely on the MarkerContext to provide that information for you:
https://github.com/playframework/play-scala-chatroom-example/blob/2.6.x/app/controllers/RequestMarkerContext.scala
You could probably use a context bounds and save on the implicit as well:
private def doStuff[A : Request]() = {
logger.info("takes an implicit Request[A] and uses RequestMarkerContext to autoconvert")
}
Yes, the MarkerContext is a good solution, but limited to the user code.
If any library is using logging, it does not work anymore.
One can use MarkerContext for the application code, and set the MDC context when calling library code.
Does anyone know if this is still an issue on version 2.6.x when using Akka HTTP server backend?
I'm going to mark this as a defect because I think we need to come up with a solution to this kind of problem.
Link to discussion I started on this topic. https://contributors.scala-lang.org/t/locals-in-scala-concurrent/1785/4
Lightbend Telemetry claims to propagate MDC properly - has anyone used this?
https://developer.lightbend.com/docs/cinnamon/current/extensions/mdc.html
On second glance, Telemetry requires a license - another possibility using Akka's DiagnosticLoggingAdapter:
https://doc.akka.io/docs/akka/2.5/logging.html#mdc-values-defined-by-the-application
There is only one way to make it works -- usage of aspects I'm almost sure that Lightbend have implemented this in this way. Generally, this is not so complex as it seems, as an example or maybe even as a solution, can be used Kamon logback project https://github.com/kamon-io/kamon-logback
Any update on this? I'm a Java developer and finding a solution is a nightmare, managed to translate many scala workarounds but nothing seems to work in runtime.
Is there any intention to address this in it's current state?
On Scala, even when providing an ExecutionContext where prepare will copy the MDC across. I have seen that MDC is lost in Filters and in Actions whenever the execution is picked up by a different thread.
It can be made to work reliably with the following changes. Is the undoing of the Accumulator Future optimisation the performance impact that is mentioned above?