jep Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep

Open bhupesh-simpledatalabs opened this issue 3 years ago • 3 comments

I am trying to run sample python code inside Scala code using Jep. In my python code i am simply creating SparkSession object via "SparkSession.builder.appName('name').master('local[1]').getOrCreate()" and executing this python code via Jep using SubInterpreter. I have also added pyspark as shared module in JepConfig to be used while creating SubInterpreter instance. My entire scala python code looks like below

val jepConfig = new JepConfig
jepConfig.addSharedModules("pyspark")
val interpreter = new SubInterpreter(jepConfig)
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter2.getValue("contextString"))

I am passing necessary environment variables as mentioned below as well as jep.jar is in classpath

-Djava.library.path=/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/jep/
PYSPARK_PYTHON=/usr/local/bin/python3
SPARK_HOME=/usr/local/Cellar/apache-spark/3.0.1
PYTHONPATH=/usr/local/Cellar/apache-spark/3.0.1/libexec/python:/usr/local/Cellar/apache-spark/3.0.1/libexec/python/lib/py4j-0.10.9-src.zip
PATH=/usr/local/Cellar/apache-spark/3.0.1/bin:/Library/Frameworks/Python.framework/Versions/3.9/bin:/usr/local/opt/[email protected]/bin:/Users/bhupeshgoel/Documents/apache-maven-3.6.3/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin

But still when i run the above scala code i am getting segmentation fault error

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x000000012d633bc9, pid=92789, tid=0x0000000000001603
#
# JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
# Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode bsd-amd64 compressed oops)
# Problematic frame:
# C  [Python+0x7bbc9]  PyModule_GetState+0x9
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/bhupeshgoel/Documents/codebase/prophecy/hs_err_pid92789.log
#
# If you would like to submit a bug report, please visit:
#   https://github.com/AdoptOpenJDK/openjdk-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Detailed error report file is also attached with this ticket.

I wanted to know if pyspark is supported with Jep specially when running pyspark code inside scala/java code? I was able to execute and create instance of SparkSession in Jep interactive session

Bhupeshs-MacBook-Pro:~ bhupeshgoel$ jep
>>> import pyspark
>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> spark = SparkSession.builder.appName('name').master("local[1]").getOrCreate()
21/04/28 13:03:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> lit(1)
Column<b'1'>

Other Environment Details are

OS Platform, Distribution, and Version: MacOS Catalina v10.15.7
Python Distribution and Version: python3.9
Java Distribution and Version: OpenJDK 1.8
Jep Version: 3.9.1
Python packages used (e.g. numpy, pandas, tensorflow): pyspark v3.1.1

hs_err_pid92789.log

Apr 28 '21 09:04 bhupesh-simpledatalabs

It is unusual that it works in the jep interactive session but not in your application. The most major difference is that the interactive session is using a SharedInterpreter, Have you tried using a SharedInterpreter instead of a SubInterpreter?

Apr 28 '21 14:04 bsteffensmeier

when i simply switch to SharedInterpreter then i get below error. I haven't changed any other environment variable and same environment setup was used with SubInterpreter.

<class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
jep.JepException: <class 'ModuleNotFoundError'>: No module named 'py4j.protocol'
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/context.<module>(context.py:27)
	at /usr/local/Cellar/apache-spark/3.0.1/libexec/python/pyspark/__init__.<module>(__init__.py:51)
	at <string>.<module>(<string>:5)
	at jep.Jep.exec(Native Method)
	at jep.Jep.exec(Jep.java:478)

My code looks like below with SharedInterpreter

val interpreter = new SharedInterpreter()
interpreter.exec(
      """import site
        |import pyspark
        |from pyspark.sql import *
        |from pyspark.sql.functions import *
        |import threading
        |spark = SparkSession.builder.appName('name').master('local[1]').getOrCreate()
        |sc = spark.sparkContext()
        |contextString = sc.getConf().toDebugString()
        |""".stripMargin)
print(interpreter.getValue("contextString"))

Apr 28 '21 14:04 bhupesh-simpledatalabs

Closing due to inactivity.

Nov 01 '22 21:11 bsteffensmeier

jep jep copied to clipboard

Unable to create pyspark's SparkSession object via embedding python code in scala application using Jep

jep
jep copied to clipboard