flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

Failed to submit JobGraph and the exception detail was not enough to detect the reason

Open jiamo opened this issue 3 years ago • 1 comments

With latest master build create example session cluster and job cluster using flink:1.12.1-scala_2.12-java11

In test docker env.

/opt/flink/bin/flink run -m flinksessioncluster-sample-jobmanager:8081 /opt/flink/examples/myfault-1.0-SNAPSHOT.jar
2021-02-04 02:31:03,798 INFO  org.apache.flink.client.cli.CliFrontend                      [] - --------------------------------------------------------------------------------
2021-02-04 02:31:03,801 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Starting Command Line Client (Version: 1.12.1, Scala: 2.12, Rev:dc404e2, Date:2021-01-09T14:46:36+01:00)
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  OS current user: root
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Current Hadoop/Kerberos user: <no hadoop dependency found>
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 11/11.0.10+9
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Maximum heap size: 709 MiBytes
2021-02-04 02:31:03,802 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JAVA_HOME: /usr/local/openjdk-11
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  No Hadoop Dependency available
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  JVM Options:
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog.file=/opt/flink/log/flink--client-myfault-run-cn9xv.log
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog4j.configuration=file:/opt/flink/conf/log4j-cli.properties
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-cli.properties
2021-02-04 02:31:03,803 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -Dlogback.configurationFile=file:/opt/flink/conf/logback.xml
2021-02-04 02:31:03,804 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Program Arguments:
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     run
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     -m
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     flinksessioncluster-sample-jobmanager:8081
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -     /opt/flink/examples/myfault-1.0-SNAPSHOT.jar
2021-02-04 02:31:03,806 INFO  org.apache.flink.client.cli.CliFrontend                      [] -  Classpath: /opt/flink/lib/flink-csv-1.12.1.jar:/opt/flink/lib/flink-json-1.12.1.jar:/opt/flink/lib/flink-shaded-zookeeper-3.4.14.jar:/opt/flink/lib/flink-table-blink_2.12-1.12.1.jar:/opt/flink/lib/flink-table_2.12-1.12.1.jar:/opt/flink/lib/log4j-1.2-api-2.12.1.jar:/opt/flink/lib/log4j-api-2.12.1.jar:/opt/flink/lib/log4j-core-2.12.1.jar:/opt/flink/lib/log4j-slf4j-impl-2.12.1.jar:/opt/flink/lib/flink-dist_2.12-1.12.1.jar:::
2021-02-04 02:31:03,807 INFO  org.apache.flink.client.cli.CliFrontend                      [] - --------------------------------------------------------------------------------
2021-02-04 02:31:03,811 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.address, myfault-run-cn9xv
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.rpc.port, 6123
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.memory.process.size, 1600m
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.memory.process.size, 1728m
2021-02-04 02:31:03,812 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2021-02-04 02:31:03,813 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: parallelism.default, 1
2021-02-04 02:31:03,813 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: jobmanager.execution.failover-strategy, region
2021-02-04 02:31:03,814 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: blob.server.port, 6124
2021-02-04 02:31:03,814 INFO  org.apache.flink.configuration.GlobalConfiguration           [] - Loading configuration property: query.server.port, 6125
2021-02-04 02:31:03,848 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Loading FallbackYarnSessionCli
2021-02-04 02:31:03,945 INFO  org.apache.flink.core.fs.FileSystem                          [] - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2021-02-04 02:31:04,068 INFO  org.apache.flink.runtime.security.modules.HadoopModuleFactory [] - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2021-02-04 02:31:04,082 INFO  org.apache.flink.runtime.security.modules.JaasModule         [] - Jaas file will be created as /tmp/jaas-5146463234971937258.conf.
2021-02-04 02:31:04,093 INFO  org.apache.flink.runtime.security.contexts.HadoopSecurityContextFactory [] - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2021-02-04 02:31:04,095 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Running 'run' command.
2021-02-04 02:31:04,230 INFO  org.apache.flink.client.cli.CliFrontend                      [] - Building program from JAR file
2021-02-04 02:31:04,325 INFO  org.apache.flink.client.ClientUtils                          [] - Starting program (detached: false)
2021-02-04 02:31:16,070 WARN  org.apache.flink.util.ExecutorUtils                          [] - ExecutorService did not terminate in time. Shutting it down now.
2021-02-04 02:31:16,074 ERROR org.apache.flink.client.cli.CliFrontend                      [] - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute job 'Fraud Detection'.
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:360) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:213) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:816) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:248) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1058) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1136) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1136) [flink-dist_2.12-1.12.1.jar:1.12.1]
Caused by: org.apache.flink.util.FlinkException: Failed to execute job 'Fraud Detection'.
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1918) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:135) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.allstoalls.FraudDetectionJob.main(FraudDetectionJob.java:48) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
	at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
	at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:343) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	... 8 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
	at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:400) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
	at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:364) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture.postFire(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal server error: Java heap space]
	at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:486) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:466) ~[flink-dist_2.12-1.12.1.jar:1.12.1]
	at java.util.concurrent.CompletableFuture$UniCompose.tryFire(Unknown Source) ~[?:?]
	at java.util.concurrent.CompletableFuture$Completion.run(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
	at java.lang.Thread.run(Unknown Source) ~[?:?]

And in the same docker

/opt/flink/bin/flink run -m flinksessioncluster-sample-jobmanager:8081 /opt/flink/examples/batch/WordCount.jar  --input /opt/flink/README.txt

works fine.

So what's the real reason on [Internal server error: Java heap space] The jar can work fine in local flink cluster.

Do we have some methods to debug it?

jiamo avatar Feb 04 '21 01:02 jiamo

figure out : default heap size : jobmanager.memory.heap.size 25165824b is too small.

using this config:

  flinkProperties:
    taskmanager.numberOfTaskSlots: "1"
    jobmanager.heap.size: ""                # set empty value (only for Flink version 1.11 or above)
    jobmanager.memory.heap.size:   150mb
    jobmanager.memory.process.size: 1gb   # job manager memory limit  (only for Flink version 1.11 or above)
    taskmanager.heap.size: ""               # set empty value
    taskmanager.memory.process.size: 1gb    # task manager memory limit

The job can submit now. The error message it is not give the special issue.

jiamo avatar Feb 04 '21 06:02 jiamo