bug A future created inside of a parallel collection map / foreach sometimes blocks the fork join pool with JDK 11

reproduction steps

(using Scala 2.12.12 or 2.13.3)

clone a project from https://github.com/OndrejSpanel/ParLoopAndFuture
run the Main class using JDK 11.0.4

problem

Observe the message "Suspicious background duration ..." being printed multiple times. This messages shows that a problem with a parallel foreach scheduling has been detected, namely that a long-duration future created from the parallel foreach has blocked the foreach executtion. This happens both with Scala 2.12.12 and Scala 2.13.3 running on JDK 11.0.4. It does not happen when running on JDK / JRE 8. It seems to happen intermittently, probably depending on exact timing of the Future call relative the to parallel collection tasks execution.

The foreach should normally terminate in much less than 200 ms, as 50 tasks are executed, each of them taking 2 ms. Once it takes more than 200 ms, it shows there is a scheduling problem with the slow future blocking the parallel collection tasks.

Following JFR screenshot shows the thread scheduling once the issue is seen (you can also see normals iterations without the issue):

My configuration is Windows 10 x64 build 2004, Intel i7 / 4x core.

The code demonstrating the issue is https://github.com/OndrejSpanel/ParLoopAndFuture/blob/c2f4d1b6b155217d8f8111f58e6a7c9337f36619/src/main/scala/com/github/ondrejspanel/parFuture/Main.scala#L1-L45

package com.github.ondrejspanel.parFuture

import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global

object Main {


  def hotSleep(ms: Int): Unit = {
    val now = System.currentTimeMillis()
    while (System.currentTimeMillis() < now + ms) {}

  }

  def main(args: Array[String]): Unit = {
    val stateCount = 50
    val state = (0 until stateCount).par

    val triggerIssue = true

    (0 until 1000).foreach { i =>
      hotSleep(25)
      val innerScopeBeg = System.currentTimeMillis()
      if (!triggerIssue) {
        Future {
          hotSleep(205)
        }
      }
      state.foreach { x =>
        if (triggerIssue && x == 0) {
          Future {
            hotSleep(205)
          }
        }
        hotSleep(2)
      }
      val innerScopeEnd = System.currentTimeMillis()
      val duration = innerScopeEnd - innerScopeBeg
      if (duration >= 200) {
        println(s"Suspicious background duration $duration")
      }
    }
  }

}

Aug 04 '20 10:08 OndrejSpanel

does using blocking help? see #12089

cc @viktorklang

Aug 04 '20 15:08 SethTisue

@SethTisue blocking could help, but ultimately, if it manifests on 2.12 it is likely a FJP thing.

Aug 04 '20 15:08 viktorklang

@SethTisue Where should I use blocking? There is no blocking or sleeping anywhere in the code. The hotSleep function is just used to simulate a CPU intensive code, no real sleeping there.

Aug 04 '20 19:08 OndrejSpanel

it is likely a FJP thing

I think this is very likely, as the thread which is waiting is inside of the java.util.concurrent.ForkJoinTask#internalWait function.

Aug 04 '20 20:08 OndrejSpanel

@OndrejSpanel blocking means "blocking progress for other tasks"—whish seems to be what hotSleep does.

Aug 04 '20 21:08 viktorklang

I have tested it and it does not help, at least when used as this:

            Future {
              blocking {
                hotSleep(205)
              }
            }

Aug 05 '20 05:08 OndrejSpanel

blocking means "blocking progress for other tasks"

@viktorklang I find such definition unusual. Compare with https://en.wikipedia.org/wiki/Blocking_(computing):

A process that is blocked is one that is waiting for some event, such as a resource becoming available or the completion of an I/O operation.

Aug 05 '20 05:08 OndrejSpanel

@OndrejSpanel Remember that when you execute on a thread pool you have cooperative scheduling for the tasks, not preemptive, so in this case, no matter if you are busy-spinning or sleeping you are preventing progress for other tasks.

Aug 05 '20 10:08 viktorklang

"busy-spinning" is just a simple replacement of a CPU intensive computation found in the real application where I see the issue. I could use a recursive Fibonacci computation instead, or something like that, but hotSleep is much eaiser to implement and understand.

When I execute one long task on a pool of 8 threads, I do not expect it to be blocking the whole process. The same code works on Java 8 without any issues and I do not see any reason why it should not.

There is some problem with Java 11 scheduling (most likely caused by some FJP implementation changes), which perhaps causes some short task to be scheduled on the thread which is busy with the long task, or something like that. Spawning more threads (which is the result of using blocking) would not help.

By posting the issue here I expect to achieve two things:

the issue to be acknowledged and documented
some reasonable workaround to be found

Fixing the issue either in Scala or in Java would be nice, but I do not expect for this to happen soon enough to help me in any way.

Aug 05 '20 11:08 OndrejSpanel

Putting the blocking discussion aside, it would be good to keep investigating.

Since the same thread pool is used, maybe the surface area of the reproduction can be reduced? Avoid using parallel collections, or maybe even write a java-only reproducer? @OndrejSpanel do you have time to give this a try? Or even in the current state, it might be worth reporting it to Java concurrency experts. http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest for example be active.

For the record, I can reproduce the issue as described in the ticket on my macOS.

Aug 12 '20 15:08 lrytz

Avoid using parallel collections

I did not see the issue when not using parallel collections. It is possible it could be reproduced by expanding the parallel collections to the plain Java executor calls, I am not sure how much work this would be.

@OndrejSpanel do you have time to give this a try?

I am not sure I want to dedicate much time to this. Given the issue shows with Java 11 which is available to general public, I guess a priority for me will be finding a suitable workaround, as I cannot expect the users of my application will avoid this particular runtime.

Aug 12 '20 15:08 OndrejSpanel

Understood, thanks.

I also looked at it in JMC. Here's the stack trace of the waiting executor thread:

void java.lang.Object.wait(long)	1	100 %
void java.util.concurrent.ForkJoinTask.internalWait(long)	1	100 %
int java.util.concurrent.ForkJoinPool.awaitJoin(ForkJoinPool$WorkQueue, ForkJoinTask, long)	1	100 %
int java.util.concurrent.ForkJoinTask.doJoin()	1	100 %
Object java.util.concurrent.ForkJoinTask.join()	1	100 %
void scala.collection.parallel.ForkJoinTasks$WrappedTask.sync()	1	100 %
void scala.collection.parallel.ForkJoinTasks$WrappedTask.sync$(ForkJoinTasks$WrappedTask)	1	100 %
void scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync()	1	100 %
void scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask.internal()	1	100 %
...

So the wait starts here: https://github.com/scala/scala/blob/v2.12.12/src/library/scala/collection/parallel/Tasks.scala#L174

It seems that the last subtask within the parallel collections task happens to be the spawned future, if I understand correctly?

I'll try to dig a little further in the next few days.

Aug 12 '20 19:08 lrytz

the last subtask within the parallel collections task happens to be the spawned future

I am not sure I understand you correctly. If you check the source code of the repro, you will see the first iteration of the parallel collection spawns a future, but this does not mean the future becomes the subtask of the collection - the subtasks finishes normally without any further interaction with the spawned future. I repeat the corresponding source lines:

      state.foreach { x =>
        if (triggerIssue && x == 0) {
          Future {
            hotSleep(205)
          }
        }
        hotSleep(2)
      }

Aug 12 '20 20:08 OndrejSpanel

Yes, the future task should be independent of the parallel collections task, but what seems to happen is that the parallel collections task waits for the future to complete at the code location I pointed to. I'll try to figure out if that's really it, and why.

Aug 13 '20 06:08 lrytz

After looking at the implementation of parallel collections and futures for a while, I managed to isolate this issue entierly from Scala and wrote a reproducer using only Java and its ForkJoinPool.

https://gist.github.com/lrytz/a622c541d91336030a8cb788273bdc66

When running on Java 11 or later, this code regularaly prints Suspicious durations, i tested teh following JDKs:

AdoptOpenJDK (build 11.0.6+10)
AdoptOpenJDK (build 14.0.1+7)
OpenJDK Runtime Environment (build 15+36-1562) (from https://jdk.java.net/15/)

Before taking this outside, people with concurrency experience that are watching this repo (e.g., @viktorklang, @NthPortal, @som-snytt, @retronym, ...) could maybe take a look at the Java source and share what they think.

Is it a bug in the JDK?
If not, is there something we can / should do from our side in parallel collections and/or futures?
If not, is there a recommendation for users to avoid this issue?

Sep 30 '20 19:09 lrytz

@lrytz As an aside, you'll want to use System.nanoTime (since currentTimeMillis may have some very "interesting" behaviors)

Sep 30 '20 20:09 viktorklang

@lrytz After having a quick look at the reproducer, I think it might be worth raising the issue on concurrency-interest.

Sep 30 '20 20:09 viktorklang

Thank you, Viktor.

Oct 01 '20 07:10 lrytz

I ran the Java example with the FJP parallelism set to 4 (that triggers the issue a lot more often than higher parallelism levels) and took 10 seconds flight recording on Java 8 and 11. There's an obvious difference: on Java 8 there are a lot of FJP threads (22):

On Java 11 there are only 5 threads:

I'll write a post on concurrency-interest.

Oct 01 '20 08:10 lrytz

If not, is there something we can / should do from our side in parallel collections and/or futures?

@lrytz I've put it on Backlog since it isn't clear there's something to do on our end, but you could milestone it for a release if you like

Oct 12 '20 16:10 SethTisue

Can something be done with this issue? I have recently tested new Java versions against the code provided by @lrytz in https://gist.github.com/lrytz/a622c541d91336030a8cb788273bdc66 and it seems to me it is happening even more frequently in newer versions. When performing 300 iterations and counting suspicious durations, I get results like this:

Java 8: Suspicious duration encountered 0 times Java 11: Suspicious duration encountered 14 times Java 17: Suspicious duration encountered 38 times Java 19: Suspicious duration encountered 37 times Java 21: Suspicious duration encountered 37 times

(All Oracle JDK, I have also tested Temurin 21 with the same result)

The fact the short tasks on fork-join pool may be randomly blocked by a longer task sounds like quite a fundamental issue and it is hard to understand for me why it does not seem important.

Regarding the questions in https://github.com/scala/bug/issues/12106#issuecomment-701590262

Is it a bug in the JDK?

If not, is there something we can / should do from our side in parallel collections and/or futures?

If not, is there a recommendation for users to avoid this issue?

I am convinced it is a bug in the JDK, but I am unsure what could I do to get some attention to it.

I'll write a post on concurrency-interest.

I do not see any link here - was there any follow up?

Dec 12 '23 10:12 OndrejSpanel

@scala/collections

Dec 12 '23 18:12 SethTisue

We need a scala/concurrent group, @SethTisue

I took a look at the original sample and updated for 2.13. I don't see a problem. On my machine, 8 available processors, if it happens that 4 long-runners are running, then the remaining 4 threads are not enough to beat the timer on the wall clock. This does not seem mind-bending to me. Maybe the overhead is surprising. The OP says 50 tasks at 2 ms = 100 ms, it should never take twice that by wall clock.

I tried to refresh my memory about scala.concurrent internals, but actually I never learned about the complex machinery.

I did just notice that I had backported to parallel collections that the optimization to fork from a fork join worker only applies if you're on the pool, but that code no longer exists in scala.concurrent, so that is really ancient history. Currently, future apply always goes through the front door of execute. (The doc says, "unless using async event-style tasks that are not usually joined, in which case there is little difference" between fork and execute.)

The other footnote is that the default pool has async mode on, so jobs are run in order; normally, you would fork and run that computation right away (to join and then continue with whatever algorithm). I think for par operations, I would not use global pool.

Sample debug output, showing that when loop 10 takes too long, it's because 4 threads are working hard and the other threads are running the futures from previous loops, F2 is when they complete:

INNER 10:22 scala-execution-context-global-131
INNER 10:47 scala-execution-context-global-197
INNER 10:48 scala-execution-context-global-166
INNER 10:36 scala-execution-context-global-233
INNER 10:23 scala-execution-context-global-131
INNER 10:49 scala-execution-context-global-166
INNER 10:24 scala-execution-context-global-131
F2 7:0 scala-execution-context-global-232
F2 8:0 scala-execution-context-global-133
F2 9:0 scala-execution-context-global-164
F2 10:0 scala-execution-context-global-129
Suspicious background duration 221 sbt-bg-threads-7

No "inner" tasks run on the same worker thread as a "future". It occurs to me that (without consulting the code) maybe tasks do wind up on the same worker as the long-runner, until they are taken by a different worker; that could result in added delays and overhead.

Perhaps Lukas's example in Java shows the same thing, namely, the fork/join takes longer because threads are occupied by the long-running computations, plus scheduling overhead.

At the end of a run, there will be a thread or two draining their remaining task or two, as opposed to four threads completing the last four tasks in parallel. (I didn't refresh my memory about how FJP actually works.)

I checked JDK 8, which seems to do better with the degenerate workload, with greater throughput on fewer threads. Here is a future about to squat on thread 153 (with time at left):

13172480 INNER2 407:29 scala-execution-context-global-148
13172480 INNER 407:30 scala-execution-context-global-148
13172480 INNER2 407:10 scala-execution-context-global-153
13172480 INNER 407:11 scala-execution-context-global-153
13172481 INNER2 407:39 scala-execution-context-global-151
13172481 INNER 407:40 scala-execution-context-global-151
13172482 INNER2 407:5 scala-execution-context-global-152
13172482 INNER2 407:17 scala-execution-context-global-149
13172482 INNER 407:18 scala-execution-context-global-149
13172482 INNER2 407:30 scala-execution-context-global-148
13172482 INNER 407:31 scala-execution-context-global-148
13172482 INNER2 407:11 scala-execution-context-global-153
13172483 F1 407:0 scala-execution-context-global-153
13172483 INNER2 407:40 scala-execution-context-global-151
13172483 INNER 407:41 scala-execution-context-global-151

INNER2 is after the busy wait, so as each thread finishes a task, it churns through the next one yolo style. Maybe there's doc for scheduling improvements that explains why the newer behavior is better.

Dec 12 '23 19:12 som-snytt

if it happens that 4 long-runners are running

The parallel collection should be always able to execute its tasks on its own (main) thread. The calling thread can never be used for long-runner, which are run as futures, using EC. In the JCM capture in the original post it can be seen the main thread is blocked, waiting for worker threads to complete.

Dec 13 '23 00:12 OndrejSpanel

Does that show 2.12 behavior? I didn't experiment with 2.12, and I'm not sure if the scala.concurrent changes for 2.13 could make a difference?

I assume your main is joining the tasks, which must be the two long bars which are not all green. I don't know (much, or enough) about the internals of FJP to guess why those threads seem to be stuck. I could imagine that is a FJP bug. I'll also experiment on JVM 11/17. Also I haven't checked for bug reports, and quickly abandoned looking for a release note.

On 2.13/JDK 21, pool threads are running the foreach tasks in parallel with the futures, which complete later. But maybe I just haven't hit this even-more-pathological case yet.

Dec 13 '23 02:12 som-snytt

I must be wrong because here it waits for futures 0-2 to complete before reporting after the foreach. This is JDK 11.

(I removed a sleep I'd added to confuse myself.)

34497623 INNER2 2:21 scala-execution-context-global-101
34497623 INNER 2:22 scala-execution-context-global-101
34497625 INNER2 2:22 scala-execution-context-global-101
34497626 INNER 2:23 scala-execution-context-global-101
34497628 INNER2 2:23 scala-execution-context-global-101
34497628 INNER 2:24 scala-execution-context-global-101
34497630 INNER2 2:24 scala-execution-context-global-101
34497713 F2 0:0 scala-execution-context-global-96
34497761 F2 1:0 scala-execution-context-global-100
34497825 F2 2:0 scala-execution-context-global-97
34497825 Suspicious background duration (238) at 2 sbt-bg-threads-1

Dec 13 '23 02:12 som-snytt

I do not see any link here - was there any follow up?

I cannot seem to find an online archive of concurrency-interest, so here's what I have from my email. I got one answer. https://gist.github.com/lrytz/a979bf2b410fc9058be317189856671c

Dec 13 '23 07:12 lrytz

So it's just trying to be helpful by assuming you used the wrong API for a subtask. This is like the optimization I mentioned for Future. I never quite understood how Future was intended to interact meaningfully with the API, so at least forking under the hood was a gesture.

Dec 13 '23 15:12 som-snytt

@som-snytt I have re-read your last comment several times and I am afraid it structure is too complex for me to understand what it means. Can you rewrite it in simpler words, please?

a priority for me will be finding a suitable workaround

Back then I have implemented a hacky workaround of an execution context which is not starting and futures while parallel loop is running. Now I am revisiting this and checking if perhaps I would be able to implement a custom Execution Context and Task Support to have a simpler and more robust solution.

I have tried using Executors.newFixedThreadPool, and while it did not show the issue, the general performance was real bad, there was a lot of thread parking between the individual tasks.

As another basic experiment I have tried what @lihaoyi wrote in https://github.com/scala/toolkit/issues/31#issuecomment-1722466461 - and voila, it does not show the issue even with Java 11 - 21 and the default Fork Join Pool execution context. See https://github.com/OndrejSpanel/ParLoopAndFuture/tree/use-futures

I guess this perhaps answer the part of the question "why is nobody interested in the issue". It does not show without parallel collections and maybe nobody is using parallel collections for any serious work?

Dec 15 '23 07:12 OndrejSpanel

@OndrejSpanel I was trying to say that Future doesn't let you do fork/join, and its default pool is "async" (event-driven), so it really caters to a certain kind of program.

Future used to have a line of code to say that if you try to run something on the "current" fork/join pool, it will try to fork instead of execute (submit from "outside" client). That is similar to what the pool is doing with the Future in your example. But I haven't looked yet at the internals. Parallel collections still has that behavior.

Thanks for the update, I'll try out your branch.

I don't know the answer to your last question, but the lesson from the issue is clearly not to mix future and par. By "clearly", I mean, "I assume so."

Dec 15 '23 07:12 som-snytt

bug bug copied to clipboard

A future created inside of a parallel collection map / foreach sometimes blocks the fork join pool with JDK 11

reproduction steps

problem

bug
bug copied to clipboard