raydp
raydp copied to clipboard
Using RayDp with SparkNLP gives an exception when trying to view the spark dataframe columns
Hello Using Raydp to start a spark cluster and then run a pipeline from John Snow Labs sparknlp. It gives me an issue when I try to view the spark dataframe. The sparkdf.show() fails for any complex data type. I am able to see a string from the same dataframe
The exception is : cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.sql.catalyst.expressions.objects.LambdaVariable.accessor of type scala.Function2 in instance of org.apache.spark.sql.catalyst.expressions.objects.LambdaVariable
start spark:
spark = raydp.init_spark(app_name='RaySparkSKLearn2', num_executors=2, executor_cores=1, executor_memory='2GB',configs={"spark.driver.extraClassPath":"/home/sa-prd-itx-aic-srvc/.ivy2/jars", "spark.serializer":"org.apache.spark.serializer.KryoSerializer", "spark.kryoserializer.buffer.max": "100M", "spark.driver.maxResultSize": "0", "spark.driver.memory": "16G", "spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1"})
Reference code from John snow labs: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb
Changed it to use spark started from raydp. Would you have any ideas on what could be going wrong here? Thanks
Hello, Glad you tried RayDP. I'm not sure, but I guess the exception is not directly caused by RayDP. What version of Spark are you using? The notebook you provide says it uses spark 3.1.2. But RayDP 0.4.1 requires Spark >= 3.2.0.
Have you tried the same code with a traditional Spark cluster? Does it work?
Thank you so much for your reply. The spark version used is 3.2.1. Dont think it is caused by RayDP. It seems to be some config which is missing when starting up spark. It looks like an issue with the spraknlp library. I have raised an issue with them too. Was checking if you can have any leads when you see these kind of exceptions?
Thanks a lot for your time
I see. My suggestion is that you can try it on a non-raydp spark cluster, like local mode or standalone. If it works on those but not raydp, we'll check why it fails.
Thanks
Yes I tried it on a non raydp cluster with the following, In raydp I used the same configs. Here the pipeline worked fine.
spark = SparkSession.builder
.appName("Spark NLP")
.master("local[4]")
.config("spark.driver.memory","16G")
.config("spark.driver.maxResultSize", "0")
.config("spark.kryoserializer.buffer.max", "2000M")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1")
.getOrCreate()
The only difference I see is while defining the pipeline, it prints about training on the device . The same piece of code in a raydp cluster doesnt print anything about training
Code that I follow is (just replaced sparknlp.start() with the above sparksession code): https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb
@SrilekhaIG , thanks for the feedback. I suspect if the SparkNLP jar file is properly included in the driver/executor classpath correctly when running on RayDP. I'll take a look. Your use cases sounds interesting. I am curious why you wants to run SparkNLP on Ray and how it helps?
I add the sparknlp files while starting spark with raydp with the following : configs={"spark.driver.extraClassPath":"/home/sa-prd-itx-aic-srvc/.ivy2/jars",.... If this is not there then I get the classnotfound exceptions.
We are using sparknlp in one of our projects and also Ray for implementing a distributed cluster for our ML programs. Needing to combine the two to give a seamless process, thats where this use case comes up.
@SrilekhaIG , there is a problem in the executor classpath when using spark.jars.packages
in RayDP. As a workaround, can you please try to install latest raydp-nightly and set "raydp.executor.extraClassPath" to the path that contains sparknlp jars? I was able to run the example you provided after setting this. But I didn't reproduce the exact error you had. Without setting this, I got the classnotfound exceptions even if I set spark.driver.extraClassPath
.
Thank you so much for checking this one out. It works with the property that you have mentioned. I think the spark.driver.extraclasspath is not needed once I used the raydp.executor.extraclasspath. Is this a permanent solution? Or a workaround for the timebeing.
@SrilekhaIG Glad to see the problem is solved. This is indeed a workaround, we'll see why spark.jars.packages fails in RayDP. But you can use raydp.executor.extraclasspath for now.
If it helps, I have raised an issue in the sparknlp git. The link to the issue is : https://github.com/JohnSnowLabs/spark-nlp/issues/7003