cloudflow
cloudflow copied to clipboard
The JVM images should be container-friendly
When running Flink jobs and anything else within a container hosting Java 1.8, or prior to Java 10, we must ensure that the following options are enabled by default:
- -XX:+UnlockExperimentalVMOptions
- -XX:+UseCGroupMemoryLimitForHeap
Failing to do so will cause the JVM to size its heap to 25% of the host's available heap. In the case of a Flink job, where more than 4GiB of memory is available to the task manager's host machine, the Linux OOM killer will likely terminate the container at some stage. I inspected the java command line for a Flink task and did not see that the above options had been applied.
I agree that those flags should be set by default. Thanks for pointing out @huntc !
@RayRoestenburg Just prompted me to look at bit harder... when I checked to see whether +XX:-UseContainerSupport was a valid option (as per Java 10), I did so on macOS. The JVM build for macOS reports the option as invalid. Whereas on Linux, it is valid:
% docker run -it adoptopenjdk/openjdk8:jdk8u272-b10-alpine sh
/ # java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal
[Global flags]
...
bool UseContainerSupport = true {product}
In any case not a bad idea to verify that our default settings are helpful. @andreaTP for Akka we set -XX:MaxRAMPercentage=50.0 -Djdk.nio.maxCachedBufferSize=1048576 for akka runner jvm (used in akka streamlet pods) by default (in operator application.conf) but we don't for Spark / Flink. With the intent that Spark/Flink would manage some of these details differently. (Spark uses a memory overhead setting for instance). It's been a while since we touched this so it can't hurt to revisit