spookystuff
spookystuff copied to clipboard
Sample example which does not works
The library looks interesting. I tried a simple example with a sample app but I got the following error
[error] (run-main-0) org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 3.0 failed 1 times, most recent failure: Lost task 5.0 in stage 3.0 (TID 29, localhost): java.lang.NullPointerException
[error] at com.tribbloids.spookystuff.utils.Utils$.uriSlash(Utils.scala:55)
[error] at com.tribbloids.spookystuff.utils.Utils$$anonfun$uriConcat$1.apply(Utils.scala:49)
[error] at com.tribbloids.spookystuff.utils.Utils$$anonfun$uriConcat$1.apply(Utils.scala:48)
[error] at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
[error] at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
[error] at com.tribbloids.spookystuff.utils.Utils$.uriConcat(Utils.scala:48)
[error] at com.tribbloids.spookystuff.pages.PageUtils$.autoRestore(PageUtils.scala:183)
[error] at com.tribbloids.spookystuff.actions.TraceView$$anonfun$4.apply(TraceView.scala:95)
[error] at com.tribbloids.spookystuff.actions.TraceView$$anonfun$4.apply(TraceView.scala:95)
[error] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error] at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[error] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[error] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[error] at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[error] at scala.collection.AbstractTraversable.map(Traversable.scala:105)
[error] at com.tribbloids.spookystuff.actions.TraceView.fetchOnce(TraceView.scala:95)
[error] at com.tribbloids.spookystuff.actions.TraceView$$anonfun$2.apply(TraceView.scala:83)
[error] at com.tribbloids.spookystuff.actions.TraceView$$anonfun$2.apply(TraceView.scala:83)
[error] at scala.util.Try$.apply(Try.scala:161)
[error] at com.tribbloids.spookystuff.utils.Utils$.retry(Utils.scala:22)
[error] at com.tribbloids.spookystuff.actions.TraceView.fetch(TraceView.scala:82)
[error] at com.tribbloids.spookystuff.sparkbinding.PageRowRDD$$anonfun$26.apply(PageRowRDD.scala:491)
[error] at com.tribbloids.spookystuff.sparkbinding.PageRowRDD$$anonfun$26.apply(PageRowRDD.scala:490)
[error] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[error] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
[error] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[error] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[error] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
[error] at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:99)
[error] at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
[error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
[error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:88)
[error] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
[error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
The application is pretty simple
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("Test")
val sc = new SparkContext(conf)
val spooky = new com.tribbloids.spookystuff.SpookyContext(sc)
import spooky.dsl._
val df = spooky.wget("https://news.google.com/?output=rss&q=barack%20obama"
).join(S"item title".texts)(
Wget(x"http://api.mymemory.translated.net/get?q=${'A}&langpair=en|fr")
)('A ~ 'title, S"translatedText".text ~ 'translated).toDF()
val csv = df.toCSV()
csv.foreach(println)
}
}
Do you have any ideas?
Could you be interested in help for the library development?
Hallo, I have a similar issue. I tried to run the sample app, but I got the following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.Accumulator.<init>(Ljava/lang/Object;Lorg/apache/spark/AccumulatorParam;Lscala/Option;)V
at com.tribbloids.spookystuff.Metrics$.accumulator(SpookyContext.scala:20)
at com.tribbloids.spookystuff.Metrics$.$lessinit$greater$default$1(SpookyContext.scala:25)
at com.tribbloids.spookystuff.SpookyContext.<init>(SpookyContext.scala:68)
at com.tribbloids.spookystuff.SpookyContext.<init>(SpookyContext.scala:72)
at FTest$.main(FTest.scala:15)
at FTest.main(FTest.scala)
16/08/18 11:48:16 INFO SparkContext: Invoking stop() from shutdown hook
16/08/18 11:48:16 INFO SparkUI: Stopped Spark web UI at http://127.0.1.1:4040
The application has the following code:
object FTest {
def main(args: Array[String]) {
//val logFile = "/home/ait/spark/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
assert(sc.parallelize(1 to 100).reduce(_ + _) == 5050)
val spooky = new SpookyContext(sc)
import spooky.dsl._
spooky.wget("https://news.google.com/?output=rss&q=barack%20obama").join(S"item title".texts)(
Wget(x"http://api.mymemory.translated.net/get?q=${'A}&langpair=en|fr"))('A ~ 'title, S"translatedText".text ~ 'translated).toDF()
}
}
Could it be because of a wrong configuration? Furthermore, I loaded all the jar files I need in my IDE, so spark should be working. So can you help me or give me hint why this error occured?
Thx in advance.
Hello @DominikRoy this seems to me a version incompatibility. Use Spark dependencies of correct versions supported by your spookystuff version.
I'm getting the same error and I'm wondering what version of spark I should be using? I don't see this specified in the documentation.
Currently I'm trying spark 1.6.2 with scala 2.10.5 trying to use com.tribbloids.spookystuff:spookystuff-core:0.3.2
Have also tried with spark 2.1.1 (scala 2.11) but that broke even sooner.
what version of spark works?
also get the same error with spark 1.5.1
can you use the master branch (0.4.0-SNAPSHOT) by compiling on your computer? Sorry about releasing very slowly, some other components are still not close to feature freeze.
0.3.2 is very old and out of maintenance. Yours Peng
On Wed, 15 Nov, 2017 at 2:32 PM, nimbusgo [email protected] wrote:
I'm getting the same error and I'm wondering what version of spark I should be using? I don't see this specified in the documentation.
Currently I'm trying spark 1.6.2 with scala 2.10.5 trying to use com.tribbloids.spookystuff:spookystuff-core:0.3.2
Have also tried with spark 2.1.1 (scala 2.11) but that broke even sooner.
what version of spark works?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
what version of spark would you recommend I use with 0.4.0-SNAPSHOT?
also, should I just be adding it via
spark-shell --jars spookystuff-core-0.4.0-SNAPSHOT.jar for example or do I need to include more?
currently attempted (including spookystuff-core-0.4.0-SNAPSHOT.jar ) on spark 1.5.1 and 1.6.2 and I get an error when attempting this:
import com.tribbloids.spookystuff.actions._
import com.tribbloids.spookystuff.dsl._
import com.tribbloids.spookystuff.SpookyContext
//this is the entry point of all queries & configurations
val spooky = SpookyContext(sc)
errors with:
error: bad symbolic reference. A signature in AbstractConf.class refers to term dsl
in package org.apache.spark.ml which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling AbstractConf.class.
error: bad symbolic reference. A signature in AbstractConf.class refers to term utils
in value org.apache.spark.ml.dsl which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling AbstractConf.class.
<console>:36: error: bad symbolic reference. A signature in AbstractConf.class refers to term messaging
in value org.apache.spark.ml.utils which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling AbstractConf.class.
val spooky = SpookyContext(sc)
org.apache.spark.ml is present, but I'm not sure why it's expecting org.apache.spark.ml.dsl to exist
The version for 0.4.0-SNAPSHOT is Spark 1.6.3 Yours Peng
On Wed, 15 Nov, 2017 at 3:05 PM, nimbusgo [email protected] wrote:
currently attempted on 1.5.1 and 1.6.2 and I get an error when attempting this:
import com.tribbloids.spookystuff.actions._ import com.tribbloids.spookystuff.dsl._ import com.tribbloids.spookystuff.SpookyContext
//this is the entry point of all queries & configurations val spooky = SpookyContext(sc) errors with:
error: bad symbolic reference. A signature in AbstractConf.class refers to term dsl in package org.apache.spark.ml which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling AbstractConf.class. error: bad symbolic reference. A signature in AbstractConf.class refers to term utils in value org.apache.spark.ml.dsl which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling AbstractConf.class.
:36: error: bad symbolic reference. A signature in AbstractConf.class refers to term messaging in value org.apache.spark.ml.utils which is not available. It may be completely missing from the current classpath, or the version on the classpath might be incompatible with the version used when compiling AbstractConf.class. val spooky = SpookyContext(sc) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
so, after some guesswork it looks like I should probably be including spookystuff-assembly-0.4.0-SNAPSHOT-spark1.6.jar which I'm doing now.
currently got this happening now:
java.lang.UnsupportedClassVersionError: com/tribbloids/spookystuff/session/python/PythonProcess : Unsupported major.minor version 52.0
guessing it's got something to do with java/py4j version inconsistencies
I think its just Java version (you are using java 7). py4j shouldn't be in my dependency list. Yours Peng
On Wed, 15 Nov, 2017 at 3:37 PM, nimbusgo [email protected] wrote:
so, after some guesswork it looks like I should probably be including spookystuff-assembly-0.4.0-SNAPSHOT-spark1.6.jar which I'm doing now.
currently got this happening now: java.lang.UnsupportedClassVersionError: com/tribbloids/spookystuff/session/python/PythonProcess : Unsupported major.minor version 52.0
guessing it's got something to do with java/py4j version inconsistencies
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
... or install it into your maven local repository using mvn install. Yours Peng
On Wed, 15 Nov, 2017 at 3:37 PM, nimbusgo [email protected] wrote:
so, after some guesswork it looks like I should probably be including spookystuff-assembly-0.4.0-SNAPSHOT-spark1.6.jar which I'm doing now.
currently got this happening now: java.lang.UnsupportedClassVersionError: com/tribbloids/spookystuff/session/python/PythonProcess : Unsupported major.minor version 52.0
guessing it's got something to do with java/py4j version inconsistencies
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for the advice, was able to get it functioning on the cluster without version errors now.
I'm a little new to the library syntax and it seems the quickstart example is a little out of date for 0.4.0-SNAPSHOT.
When executing this:
spooky.wget("https://news.google.com/?output=rss&q=barack%20obama").join(S"item title".texts){
Wget(x"http://api.mymemory.translated.net/get?q=${'A}&langpair=en|fr")
}('A ~ 'title, S"translatedText".text ~ 'translated).toDF()
I get this error:
error: com.tribbloids.spookystuff.rdd.FetchedDataset does not take parameters
}('A ~ 'title, S"translatedText".text ~ 'translated).toDF()
Are there any quickstart examples that work for 0.4.0-SNAPSHOT that I can take a look at?
Yeah, a lot, can't help, better algorithms keep poping all the time.
I recommend you to refer to the test cases in integration submodule.
It serves as a short example to crawl this dummy website: http://webscraper.io/test-sites. Yours Peng
On Thu, 16 Nov, 2017 at 4:31 PM, nimbusgo [email protected] wrote:
Thanks for the advice, was able to get it functioning on the cluster without version errors now.
I'm a little new to the library syntax and it seems the quickstart example is a little out of date for 0.4.0-SNAPSHOT.
When executing this:
spooky.wget("https://news.google.com/?output=rss&q=barack%20obama").join(S"item title".texts){
Wget(x"http://api.mymemory.translated.net/get?q=${'A}&langpair=en|fr") }('A ~ 'title, S"translatedText".text ~ 'translated).toDF() I get this error:
error: com.tribbloids.spookystuff.rdd.FetchedDataset does not take parameters }('A ~ 'title, S"translatedText".text ~ 'translated).toDF() Are there any quickstart examples that work for 0.4.0-SNAPSHOT that I can take a look at?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Its been 13 days, should I close it?
org.apache.spark.ml is present, but I'm not sure why it's expecting org.apache.spark.ml.dsl to exist
After check the code source, I see org.apache.spark.ml.dsl is a package contained in the directory mldsl/ of project.
You can publish the source code in your local repository, and include into your spark-shell
~/.m2/repository/com/tribbloids/spookystuff/spookystuff-mldsl/0.7.0-SNAPSHOT/spookystuff-mldsl-0.7.0-SNAPSHOT.jar
The module mldsl should be published on maven repository, and add as dependancy in documentation page! spookystuff is a interested project, but if the getting start example doesn't work. It can dissuade many developer to use it. It is urgent to update the documentation website. Where we can modify the documentation page? @tribbloid