kotlinx.coroutines icon indicating copy to clipboard operation
kotlinx.coroutines copied to clipboard

Coroutine scheduler monitoring

Open asad-awadia opened this issue 6 years ago • 25 comments
trafficstars

Are there any monitoring tools available for how many coroutines are currently active and their state etc? It would be nice if it could be exposed so that something like Prometheus can scrape it and visualise it in grafana.

It will also help in debugging leaks - and errors that might occur if we see coroutines just rising linearly

If not can this be done by looking at the thread stats instead?

Go exposes it via runtime.NumGoroutines()

Related:? https://github.com/Kotlin/kotlinx.coroutines/issues/494

asad-awadia avatar Jul 21 '19 00:07 asad-awadia

Please, take a look at kotlinx-coroutines-debug module: https://github.com/Kotlin/kotlinx.coroutines/blob/master/kotlinx-coroutines-debug/README.md

elizarov avatar Jul 22 '19 16:07 elizarov

This does look pretty useful, but it also seems like it might have a notable performance impact?

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

Our biggest fear is accidentally putting slow blocking work (or worse, deadlocks) in our main dispatcher (which happened to us once on a previous project using Kotlin coroutines incorrectly, and also when using Ratpack’s coroutine-style execution).

So getting alerted if work is building up over time (ie, if the queues are getting too big/growing indefinitely) seems helpful.

Would it be reasonable to expose some of these stats somewhere? These stats are specific to the CoroutineScheduler so I don't think kotlinx-coroutines-debug is relevant.

As an awful hack we are considering parsing (Dispatchers.Default as ExecutorCoroutineDispatcher).executor.toString(), with full understanding that it may break at any time.

glasser avatar Aug 14 '19 06:08 glasser

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local).

@glasser Yes, that can be done without the slow debug mode and makes sense. I'll keep it open as an enhancement.

elizarov avatar Aug 14 '19 07:08 elizarov

Thanks! Should I interpret that as "you're going to do it" or "you'd accept patches"?

glasser avatar Aug 14 '19 08:08 glasser

Unfortunately, we are not ready to accept patches right now because the scheduler is being actively reworked.

But it would be really helpful if you could provide a more detailed example of the desired API shape and problem you want to solve with this API.

For example, "Ideally, we'd see it as pluggable SPI service for dispatcher with the following methods ..., so we could use to trigger our monitoring if ..."

qwwdfsad avatar Aug 14 '19 11:08 qwwdfsad

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

My proposal is pretty simple. A few of the core objects involved with coroutine scheduling should be (a) publicly accessible and (b) expose a few properties that provide statistics about them. It's fine if these are documented as "experiment, up for change, don't rely on this" and as "fetching these properties may have a performance impact if done frequently" (eg, ConcurrentLinkedQueue.size is O(n)).

Most specifically, I'd want to have access to

  • ExperimentalCoroutineDispatcher.coroutineScheduler (which perhaps would return an interface declared to only contain the metrics below)
  • LimitingDispatcher.queueSize: Int
  • CoroutineScheduler.corePoolSize: Int
  • CoroutineScheduler.maxPoolSize: Int
  • CoroutineScheduler.queueSizes: Map<WorkerState, List<Int>>
  • CoroutineScheduler.globalQueueSize: Int
  • CoroutineScheduler.schedulerName: String (for tagging in the unlikely case of multiple schedulers) (ie, basically all the stuff in CoroutineScheduler.toString(); I think getting the control state isn't super necessary.)

I don't need kotlinx.coroutines to provide any machinery for hooking this up to my metrics service: I'm happy to keep at application (or external library) level the code that takes the dispatchers I care about, polls them for metrics, and publishes to my metrics service of choice.

glasser avatar Aug 14 '19 21:08 glasser

Interesting — is there a branch or design doc or something for the reworking? Curious how it's changing.

No for both, though changes will be, of course, properly documented. But mostly it's about changing the parking/spinning strategy without violating liveness property to reduce CPU consumption during the low rate of the requests and to have a robust idle thread termination. Change is just too intrusive and touches all the places in the scheduler.

Thanks for the details! Could you please clarify, is it for Android app or for some backend service? Asking because there are also chances that Dispatches.Default will be backed with ForkJoinPool on Android by default (mostly to reduce dex size and count of threads), so we have to interop this observability with FJP as well.

qwwdfsad avatar Aug 20 '19 09:08 qwwdfsad

This is for server usage.

We are currently porting a few web servers from Ratpack to Ktor. Ratpack has a similar async structure (with a recommended usage of a pool of "compute" threads approximately equal in size to the number of CPUs plus a scaling "blocking" pool) to Kotlin coroutines, but because you have to do all work with explicit Promise composition rather than the nice syntax of Kotlin coroutines, we've found that developers often don't bother to keep blocking work out of the compute pool, and often implement error handling incorrectly (eg by putting try/catch/finally or retry loops around functions that return Promises rather than properly using the Promise API). Our hope is that Kotlin coroutines will be much more accessible. But we still want to monitor that we're not clogging up the pools!

(Ratpack Promises also have some other odd behavior — eg, Blocking.get {}, which is somewhat like withContext(Dispatchers.IO {}, does not actually invoke the given block on the scalable threadpool until after the currently-running code fully returns to the event loop (equivalent of suspension), which meant that some misguided attempts to make a blocking call within a non-Promise-returning function use the "right" threadpool by writing (effectively) Blocking.get {}.get() not only tied up the current thread like you might expect, but actually blocked indefinitely because the code never got run! Hopefully our complete rewrite will avoid these border cases.)

glasser avatar Aug 20 '19 17:08 glasser

+1 to everything that @glasser said. Looking to start replacing some thread pools with coroutines in our high-volume, production, back-end service, and would feel a lot better about it if we had some way to emit metrics about the health of the pools/scheduler. Thanks!

cprice404 avatar Sep 23 '19 03:09 cprice404

I have an app that launches millions of coroutines that are CPU bound and they are taking longer than would be expected to complete. I am wondering if they are taking a long time because of the overhead of them being scheduled and executed. Would like to have monitoring on the queue size for this reason.

lfmunoz avatar Jan 10 '20 19:01 lfmunoz

Any updates on this? Any news when it may be implemented? We are also interested in monitoring number of Coroutinies, and it is really disappointing, that such basic metric is not available by default.

damian-pacierpnik-jamf avatar Aug 26 '20 17:08 damian-pacierpnik-jamf

Any updates on this? Any other ways of getting similar numbers? Wanting metrics basically because of the same reasons as @glasser . :)

anderssv avatar Sep 15 '20 19:09 anderssv

Any updates? I'm interested as well.

vikiselev avatar Nov 09 '20 09:11 vikiselev

Also interested in this

premnirmal avatar Apr 06 '21 09:04 premnirmal

We aim to implement it in the next releases after 1.5.0

qwwdfsad avatar Apr 06 '21 09:04 qwwdfsad

Our use case is also high load server side. In addition to the metrics glasser mentioned:

  • latency
  • completed tasks.

joost-de-vries avatar Aug 10 '21 17:08 joost-de-vries

@qwwdfsad any updates? Also very much interested in this.

soudmaijer avatar Sep 08 '21 09:09 soudmaijer

@soudmaijer for us this is so critical that I implemented the 'awful hack' that glasser mentioned. See https://github.com/joost-de-vries/spring-reactor-coroutine-metrics/tree/coroutineDispatcherMetrics/src/main/kotlin/metrics

joost-de-vries avatar Sep 08 '21 09:09 joost-de-vries

We aim to implement it in the next releases after 1.5.0

Does that mean that this will be addressed in 1.6.0 (which appears to be close to release)?

cprice404 avatar Nov 12 '21 13:11 cprice404

In IJ we have own unlimited executor (let's call it ApplicationPool). We log a thread dump when the number of threads exceeds a certain value, but we don't prevent spawning new threads. I'd like to replace ApplicationPool with Dispatchers.IO.limitedParallelism(MAX_VALUE), but I'm missing the diagnostics part.

Using effectively unlimited IO dispatcher will allow us to drop own executor service (single pool for the whole app approach) and avoid unnecessary thread switches which inevitably happen between Dispatchers.Default and ApplicationPool.asCoroutineDispatcher().

dovchinnikov avatar Sep 07 '22 14:09 dovchinnikov

Is there any update on this issue?

jaredjstewart avatar Apr 06 '23 00:04 jaredjstewart

any update?

chenzhihui28 avatar Apr 12 '23 07:04 chenzhihui28

@joost-de-vries is your hack still working out reasonably well for you?

glasser avatar Apr 14 '23 19:04 glasser

Is there any update on this issue?

cleidiano avatar Apr 25 '24 01:04 cleidiano