spark-jobserver User did not initialize spark context!

Used Spark version Spark Version: 2.4.0 Used Spark Job Server version SJS version: v0.11.0 Deployed mode Yarn cluster mode Actual (wrong) behavior Spark context creation fails with the following error: [2021-06-09 17:17:56,338] INFO loy.yarn.ApplicationMaster [] - Final app status: FAILED, exitCode: 13 [2021-06-09 17:17:56,348] ERROR loy.yarn.ApplicationMaster [] - Uncaught exception: java.lang.IllegalStateException: User did not initialize spark context! at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:465) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:276) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:821) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:820) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:820) Steps to reproduce

Start SJS using server_start.sh
Try to start Spark context using the following command: http://XXX:8090/contexts/spark_context?context-factory=spark.jobserver.context.SessionContextFactory& Logs Conf file: env.conf.txt

Spark job server log spark-job-server.log.txt

Yarn context application log: spark-job-server.out.txt

It was working fine on SJS 0.9

Jun 10 '21 01:06 pgouda89

Do you write 'master("local[]")' or 'setMaster("local[]")' in your code?

Jun 24 '21 02:06 wanshicheng

I have the same problem. Have you solved it?

Jun 24 '21 08:06 hujian0923

I was not able to resolve the issue. I am waiting for someone to confirm that Yarn cluster mode works with SJS 0.10 or above.

Jun 24 '21 16:06 pgouda89

I was not able to resolve the issue. I am waiting for someone to confirm that Yarn cluster mode works with SJS 0.10 or above.

I tried SJS 0.11.1, I couldn't run the yarn cluster mode, but I could run the yarn client mode.

Jun 25 '21 01:06 hujian0923

Do you write 'master("local[]")' or 'setMaster("local[]")' in your code?

No coding, just create a context with API.

e.g. curl -d "" "localhost:8090/contexts/ctx-example?context-factory=spark.jobserver.context.SessionContextFactory&spark.executor.instances=2&spark.executor.cores=1&spark.executor.memory=1g&spark.driver.memory=1g"

Jun 25 '21 01:06 hujian0923

@vglagoleva hello,has the official fully tested yarn cluster mode on spark3.x(SJS 0.11.1 ), if not ,what's the plan?

Jun 25 '21 02:06 jimolonely

@jimolonely SJS 0.11.1 doesn't not support Spark 3 at all. There is an open pull request, which was not yet reviewed by anyone.

Jun 26 '21 17:06 vglagoleva

@jimolonely SJS 0.11.1 doesn't not support Spark 3 at all. There is an open pull request, which was not yet reviewed by anyone.

spark2.4.2 + SJS0.11.1 + Yarn cluster mode, same error, have spark 2.x tested？

Jun 29 '21 10:06 hujian0923

Spark 2.4.2 is supported by SJS 0.11.1. Regarding YARN: we never had some specific tests for YARN, because Jobserver has no special logic for it. In the end, Jobserver just uses spark-submit command.

Nevertheless, if you run Jobserver in cluster mode, please pay attention, that Jobserver binary is uploaded to some distributed database and that you don't use some default in-memory H2 backend. It is very important that backend is setup correctly and your MANAGER_JAR_FILE variable is pointing to the path to the file in HDFS/PostgreSQL/.. and not to some local path.

Another thing is to check that you use correct Scala version. By default current Jobserver master branch compiles for Scala 2.12. You may need to use export SCALA_VERSION=2.11.8. Mismatch of Scala versions may also cause some unexpected errors.

I am not a YARN user myself so I can't help you more.

Jun 30 '21 17:06 vglagoleva

Hi @vglagoleva, Agreed, I am suspecting akka-actor upgrade(was using the spray server in SJS 0.9.0). What I have noticed that akka actors are not able to communicate between the cluster node and the job server node(akka master).

Jun 30 '21 20:06 pgouda89

Thanks @vglagoleva , Issue has been resolved and now we are able to initialize the context. Thanks for the quick response 🙂

Jul 01 '21 01:07 venkatkrishna110

Thanks @vglagoleva , Issue has been resolved and now we are able to initialize the context. Thanks for the quick response 🙂

How to solve it?

Jul 01 '21 01:07 hujian0923

I found several warnings. Maybe this is the reason?

21/07/02 16:08:57 INFO JobManagerActor: Starting actor spark.jobserver.JobManagerActor 21/07/02 16:08:57 INFO ProductionReaper: Starting actor spark.jobserver.common.akka.actor.ProductionReaper 21/07/02 16:08:57 WARN JobDAOActor: Shutting down spark.jobserver.io.JobDAOActor 21/07/02 16:08:57 WARN ProductionReaper: Shutting down spark.jobserver.common.akka.actor.ProductionReaper

Jul 02 '21 08:07 hujian0923

hi @venkatkrishna110, Can you please share the env, conf and spark properties you are using for yarn cluster mode?

Jul 02 '21 14:07 pgouda89

Hi @vglagoleva @pgouda89 I probably positioned it to be caused by ProductionReaper and JobDAOActor being Shutting down, but it is temporarily unclear why they were Shutting down.

Jul 03 '21 10:07 hujian0923

Hi @hujian0923 , As @pgouda89 mentioned, if you share your configuration files, maybe we can find some clues there. It's hard to say something otherwise.

Jul 03 '21 19:07 vglagoleva

Hi @hujian0923 , As @pgouda89 mentioned, if you share your configuration files, maybe we can find some clues there. It's hard to say something otherwise.

Configuration information:

spark {
  master = "yarn"
  submit.deployMode = "cluster"
  job-number-cpus = 4

  jobserver {
    port = 8090

    context-per-jvm = true

    context-creation-timeout = 1000000 s
    yarn-context-creation-timeout = 1000000 s
    default-sync-timeout = 1000000 s
    short-timeout = 60 s
    max-jobs-per-context = 80

    jobdao = spark.jobserver.io.JobSqlDAO
	
    filedao {
      rootdir = /tmp/spark-jobserver/filedao/data
    }
	
    datadao {
      rootdir = /tmp/spark-jobserver/upload
    }

    sqldao {
      slick-driver = slick.jdbc.MySQLProfile
      jdbc-driver = com.mysql.jdbc.Driver

      rootdir = /tmp/spark-jobserver/sqldao/data

      jdbc {
        url = "jdbc:mysql://hadoop01:8100/jobserver?serverTimezone=Asia/Shanghai"
        user = "jobserver"
        password = "jobserver"
      }

      dbcp {
        enabled = false
        maxactive = 20
        maxidle = 10
        initialsize = 10
      }
    }
    result-chunk-size = 1m
  }

  context-settings {
    context-factory = "spark.jobserver.context.SessionContextFactory"
    num-cpu-cores = 1   
    memory-per-node = 1G
    
    forked-jvm-init-timeout = 300 s

    context-init-timeout = 1000000 s
    passthrough {
      #es.nodes = "192.1.1.1"
    }
  }

}

akka.http.server {
    idle-timeout = 1200 s
    request-timeout = 1000 s
    
    parsing.max-content-length = 300m
}

flyway.locations="db/mysql/migration"

akka {
  remote.netty.tcp {
    maximum-frame-size = 5120 MiB
    hostname = "hadoop01"
  }
}

Jul 04 '21 09:07 hujian0923

I have the same problem.

Aug 26 '21 06:08 beyondlgl

Thanks @vglagoleva , Issue has been resolved and now we are able to initialize the context. Thanks for the quick response 🙂

Could you share some ideas for solution? thanks @venkatkrishna110

Feb 02 '23 12:02 godfather1103