spark-solr
spark-solr copied to clipboard
spark-solr java.lang.ClassNotFoundException: solr.DefaultSource
Hi @kiranchitturi
I am using spark-solr 3.0.4, apache spark 2.0.2 with solr 7.3.0 and getting the above exception. The same exception persists with different other versions of spark-solr connector. Tried making several attempts by changing the spark and solr versions as per the recommendations here [https://github.com/lucidworks/spark-solr] but nothing seems working.
Below is the POM:
<dependency>
<groupId>com.lucidworks.spark</groupId>
<artifactId>spark-solr</artifactId>
<version>3.0.4</version>
</dependency>
<dependency>
<groupId>com.sun</groupId>
<artifactId>tools</artifactId>
<version>1.8.0</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.8.0</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
Exception occured:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: solr. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122) at com.ezest.spark.solr.sparkSolrConnTest.searchFromSolrToSpark(sparkSolrConnTest.java:37) at com.ezest.spark.solr.sparkSolrConnTest.main(sparkSolrConnTest.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:736) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:185) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:210) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: solr.DefaultSource at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132) at scala.util.Try.orElse(Try.scala:84) at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132)
Update: Please note that I am able to run it locally with eclipse and solr installed on my windows machine. The Exception occurs while submitting it on hadoop cluster using
bin/spark-submit --class <Project.Package>.sparkSolrConnTest /opt/sbt_jars/spark-solr-test-0.0.1-SNAPSHOT.jar --master local
Please check that spark-solr jar is on the classpath or in your shaded jar
Thanks @kiranchitturi . Got the rid of above exception by adding the spark-solr shaded jar in spark-submit command. However, the Job is not able to locate any of the solr collections and throws org.apache.solr.common.SolrException: Collection not found exception.
I am able to read data from theses collections while executing the code from eclipse but not through spark-submit.
Please check if you have the right zkhost and check the logs
csvDF.write.format("solr").options(options).mode(org.apache.spark.sql.SaveMode.Overwrite).save java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at com.lucidworks.spark.SolrRelation.insert(SolrRelation.scala:634) at solr.DefaultSource.createRelation(DefaultSource.scala:27) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) ... 49 elided
I am getting same as @rashid-1989. spark-submit's master is yarn in cluster mode. Spark : 2.2.0.cloudera3 with com.lucidworks.spark:spark-solr:jar:3.3.4 1st microbatch works perfectly and writes into solr but then classNotFound from second microbatch in driver.
Hi Kiran, if you have it resolved please let me know what your solution was.
My action was forEachRDD(x -> .....write().format("solr").....)
verbose:class shows that solr.DefaultSource was loaded from my app's uber jar (solr.DefaultSource is available in it. I decompiled and confirmed that nothings wrong with class)
_
java.lang.ClassNotFoundException: Failed to find data source: solr. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:546)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:467)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at com.app1.app3.app4.app5.app6.writeToSolr(app6.java:236)
at com.app1.app3.app4.app5.app6.toDF(app6.java:119)
at com.app1.app3.app4.app5.app6.app7.execute(app7.java:275)
at com.app1.app3.app4.app5.app6.execute(app6.java:86)
at com.app1.app3.spark.app1.OSF.execute(OSF.java:176)
at com.app1.app3.spark.app3.OSF.call(OSF.java:121)
at com.app1.app3.spark.app2.OSF.call(OSF.java:59)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:280)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:280)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: solr.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22$$anonfun$apply$14.apply(DataSource.scala:530)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:530)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$22.apply(DataSource.scala:530)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:530)
... 43 more
_
Make sure the jar is present on the driver classpath. Check the logs for driver classpath
Kiran, I'm trying to run a PySpark example and getting the same issue.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python basic example") \
.getOrCreate()
df = spark.read.format("solr").option("collection", "system_history").load()
print("No. of docs in logs collection {}".format(df.count()))
spark.stop()
spark.sparkContext._jvm.java.lang.System.exit(0)
ERROR
File "/home/steph/Projects/fusion-spark-job-workbench/python_examples/count_docs.py", line 9, in <module>
df = spark.read.format("solr").option("collection", "system_history").load()
File "/home/steph/Projects/fusion/4.2.6/apps/spark-dist/python/pyspark/sql/readwriter.py", line 172, in load
return self._df(self._jreader.load())
File "/home/steph/Projects/fusion/4.2.6/apps/spark-dist/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/home/steph/Projects/fusion/4.2.6/apps/spark-dist/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/steph/Projects/fusion/4.2.6/apps/spark-dist/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: java.lang.ClassNotFoundException: Failed to find data source: solr. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:639)
The example works when I upload it to Fusion, but not when I debug it outside. Which library am I missing?
@svanschalkwyk did you manage to solve this?
I believe it was a jar which was not installed. Check the classpath and determine which spark jars need to be there.