raydp icon indicating copy to clipboard operation
raydp copied to clipboard

add support of init_spark from existing SparkSession?

Open zhh210 opened this issue 5 months ago • 0 comments

Is it possible to initialize the spark object from an existing SparkSession? The use case is that my work environment needs a special customized SparkSession that were wrapped up with complicated corporate credentials and setups. Running init_spark() from the raydp example won't work as it is not aware of them. I can create a SparkSession object using the customized wrapper though but don't know how I can pass it over to raydp.

The raydp example using standard spark:

import ray
import raydp

# connect to ray cluster
ray.init(address='auto')

# create a Spark cluster with specified resource requirements
spark = raydp.init_spark(app_name='RayDP Example',
                         num_executors=2,
                         executor_cores=2,
                         executor_memory='4GB')

# normal data processesing with Spark
df = spark.createDataFrame([('look',), ('spark',), ('tutorial',), ('spark',), ('look', ), ('python', )], ['word'])
df.show()
word_count = df.groupBy('word').count()
word_count.show()

# stop the spark cluster
raydp.stop_spark()

Proposed raydp using existing SparkSession:

spark_session = get_customized_ss()
spark = spark_init(spark_session)

zhh210 avatar Sep 24 '24 14:09 zhh210