jep icon indicating copy to clipboard operation
jep copied to clipboard

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep

Open bhupesh-simpledatalabs opened this issue 3 years ago • 3 comments

I am trying to run sample python code inside Scala code using Jep. In my python code i am simply creating SparkSession object via "SparkSession.builder.appName('name').master('local[1]').getOrCreate()" and executing this python code via Jep using SubInterpreter. I have also added pyspark as shared module in JepConfig to be used while creating SubInterpreter instance. My entire scala python code looks like below

val jepConfig = new JepConfig
jepConfig.addSharedModules("pyspark")
val interpreter = new SubInterpreter(jepConfig)
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter2.getValue("contextString"))

I am passing necessary environment variables as mentioned below as well as jep.jar is in classpath

-Djava.library.path=/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/jep/
PYSPARK_PYTHON=/usr/local/bin/python3
SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1
PYTHONPATH=/usr/local/Cellar/apache-spark/3.0.1/libexec/python:/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip
PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:/Library/Frameworks/Python.framework/Versions/3.9/bin:/usr/local/opt/[email protected]/bin:/Users/bhupeshgoel/Documents/apache-maven-3.6.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

But still when i run the above scala code i am getting segmentation fault error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012d633bc9, pid=92789, tid=0x0000000000001603
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [Python+0x7bbc9]  PyModule_GetState+0x9
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/bhupeshgoel/Documents/codebase/prophecy/hs_err_pid92789.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Detailed error report file is also attached with this ticket.

I wanted to know if pyspark is supported with Jep specially when running pyspark code inside scala/java code? I was able to execute and create instance of SparkSession in Jep interactive session

Bhupeshs-MacBook-Pro:~ bhupeshgoel$ jep
>>> import pyspark
>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> spark = SparkSession.builder.appName('name').master("local[1]").getOrCreate()
21/04/28 13:03:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> lit(1)
Column<b'1'>

Other Environment Details are

  • OS Platform, Distribution, and Version: MacOS Catalina v10.15.7
  • Python Distribution and Version: python3.9
  • Java Distribution and Version: OpenJDK 1.8
  • Jep Version: 3.9.1
  • Python packages used (e.g. numpy, pandas, tensorflow): pyspark v3.1.1

hs_err_pid92789.log

bhupesh-simpledatalabs avatar Apr 28 '21 09:04 bhupesh-simpledatalabs

It is unusual that it works in the jep interactive session but not in your application. The most major difference is that the interactive session is using a SharedInterpreter, Have you tried using a SharedInterpreter instead of a SubInterpreter?

bsteffensmeier avatar Apr 28 '21 14:04 bsteffensmeier

when i simply switch to SharedInterpreter then i get below error. I haven't changed any other environment variable and same environment setup was used with SubInterpreter.

<class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
jep.JepException: <class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/context.<module>(context.py:27)
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/__init__.<module>(__init__.py:51)
	at <string>.<module>(<string>:5)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)

My code looks like below with SharedInterpreter

val interpreter = new SharedInterpreter()
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter.getValue("contextString"))

bhupesh-simpledatalabs avatar Apr 28 '21 14:04 bhupesh-simpledatalabs

Closing due to inactivity.

bsteffensmeier avatar Nov 01 '22 21:11 bsteffensmeier