raydp icon indicating copy to clipboard operation
raydp copied to clipboard

Using RayDp with SparkNLP gives an exception when trying to view the spark dataframe columns

Open SrilekhaIG opened this issue 2 years ago • 10 comments

Hello Using Raydp to start a spark cluster and then run a pipeline from John Snow Labs sparknlp. It gives me an issue when I try to view the spark dataframe. The sparkdf.show() fails for any complex data type. I am able to see a string from the same dataframe

The exception is : cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.objects.LambdaVariable.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.objects.LambdaVariable

start spark:

spark = raydp.init_spark(app_name='RaySparkSKLearn2', num_executors=2, executor_cores=1, executor_memory='2GB',configs={"spark.driver.extraClassPath":"/home/sa-prd-itx-aic-srvc/.ivy2/jars", "spark.serializer":"org.apache.spark.serializer.KryoSerializer", "spark.kryoserializer.buffer.max": "100M", "spark.driver.maxResultSize": "0", "spark.driver.memory": "16G", "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1"})

Reference code from John snow labs: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb

Changed it to use spark started from raydp. Would you have any ideas on what could be going wrong here? Thanks

SrilekhaIG avatar Mar 02 '22 10:03 SrilekhaIG

Hello, Glad you tried RayDP. I'm not sure, but I guess the exception is not directly caused by RayDP. What version of Spark are you using? The notebook you provide says it uses spark 3.1.2. But RayDP 0.4.1 requires Spark >= 3.2.0.

Have you tried the same code with a traditional Spark cluster? Does it work?

kira-lin avatar Mar 03 '22 02:03 kira-lin

Thank you so much for your reply. The spark version used is 3.2.1. Dont think it is caused by RayDP. It seems to be some config which is missing when starting up spark. It looks like an issue with the spraknlp library. I have raised an issue with them too. Was checking if you can have any leads when you see these kind of exceptions?

Thanks a lot for your time

SrilekhaIG avatar Mar 03 '22 07:03 SrilekhaIG

I see. My suggestion is that you can try it on a non-raydp spark cluster, like local mode or standalone. If it works on those but not raydp, we'll check why it fails.

Thanks

kira-lin avatar Mar 03 '22 07:03 kira-lin

Yes I tried it on a non raydp cluster with the following, In raydp I used the same configs. Here the pipeline worked fine.

spark = SparkSession.builder
.appName("Spark NLP")
.master("local[4]")
.config("spark.driver.memory","16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1")
.getOrCreate()

The only difference I see is while defining the pipeline, it prints about training on the device . The same piece of code in a raydp cluster doesnt print anything about training

Code that I follow is (just replaced sparknlp.start() with the above sparksession code): https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb

SrilekhaIG avatar Mar 03 '22 08:03 SrilekhaIG

@SrilekhaIG , thanks for the feedback. I suspect if the SparkNLP jar file is properly included in the driver/executor classpath correctly when running on RayDP. I'll take a look. Your use cases sounds interesting. I am curious why you wants to run SparkNLP on Ray and how it helps?

carsonwang avatar Mar 03 '22 08:03 carsonwang

I add the sparknlp files while starting spark with raydp with the following : configs={"spark.driver.extraClassPath":"/home/sa-prd-itx-aic-srvc/.ivy2/jars",.... If this is not there then I get the classnotfound exceptions.

We are using sparknlp in one of our projects and also Ray for implementing a distributed cluster for our ML programs. Needing to combine the two to give a seamless process, thats where this use case comes up.

SrilekhaIG avatar Mar 03 '22 10:03 SrilekhaIG

@SrilekhaIG , there is a problem in the executor classpath when using spark.jars.packages in RayDP. As a workaround, can you please try to install latest raydp-nightly and set "raydp.executor.extraClassPath" to the path that contains sparknlp jars? I was able to run the example you provided after setting this. But I didn't reproduce the exact error you had. Without setting this, I got the classnotfound exceptions even if I set spark.driver.extraClassPath.

carsonwang avatar Mar 07 '22 10:03 carsonwang

Thank you so much for checking this one out. It works with the property that you have mentioned. I think the spark.driver.extraclasspath is not needed once I used the raydp.executor.extraclasspath. Is this a permanent solution? Or a workaround for the timebeing.

SrilekhaIG avatar Mar 08 '22 06:03 SrilekhaIG

@SrilekhaIG Glad to see the problem is solved. This is indeed a workaround, we'll see why spark.jars.packages fails in RayDP. But you can use raydp.executor.extraclasspath for now.

kira-lin avatar Mar 08 '22 08:03 kira-lin

If it helps, I have raised an issue in the sparknlp git. The link to the issue is : https://github.com/JohnSnowLabs/spark-nlp/issues/7003

SrilekhaIG avatar Mar 08 '22 10:03 SrilekhaIG