spark-knn knn.fit(training) throws an exception

trafficstars

followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) 1st error : TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)

java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)

Jan 09 '18 13:01 akshaybhatt14495

Hi, If you at the example: https://github.com/saurfang/spark-knn/blob/master/spark-knn-examples/src/main/scala/com/github/saurfang/spark/ml/knn/examples/MNIST.scala

For KNNClassifier object it sets the two column names i.e. features, prediction

.setFeaturesCol("pcaFeatures") .setPredictionCol("predicted") These seems to be missing in your case.

On Tue, Jan 9, 2018 at 6:35 PM, akshaybhatt14495 [email protected] wrote:

followed whatever was there val training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_ data.txt").toDF() val knn = new KNNClassifier() .setTopTreeSize(training.count().toInt / 500) .setK(10) TopTreeSize is invalid 0 (since total count of training sample is 100) let say we set manually TreeSize as 1 then it throws an exception while running knn.fit(training)

java.util.NoSuchElementException: Failed to find a default value for inputCols at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$ 2.apply(params.scala:652) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:651) at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42) at org.apache.spark.ml.param.Params$class.$(params.scala:658) at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42) at org.apache.spark.ml.knn.KNN.fit(KNN.scala:383)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfs5XmsuVtOzeSTxLB34e5uUlos32sks5tI2QDgaJpZM4RXy9_ .

Jan 09 '18 14:01 kaushikacharya

@kaushikacharya thanks for response, actually i need k nearest neighbors (KNN) , so for that do we need classification in dataset (i.e. first entry in each case as 0 or 1)??

Jan 10 '18 05:01 akshaybhatt14495

@kaushikacharya i'm talking about KNN.scala

Jan 10 '18 05:01 akshaybhatt14495

Got another error in command knn.fit(training)

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)

Jan 10 '18 06:01 akshaybhatt14495

Which spark version are you using?

These might be helpful for resolving the ml vs mllib error:

https://stackoverflow.com/questions/38901123/how-convert-ml-vectorudt-features-from-mllib-to-ml-type

https://spark.apache.org/docs/2.1.0/ml-migration-guides.html "While most pipeline components support backward compatibility for loading, some existing DataFrames and pipelines in Spark versions prior to 2.0, that contain vector or matrix columns, may need to be migrated to the new spark.ml vector and matrix types. Utilities for converting DataFrame columns from spark.mllib.linalg to spark.ml.linalg types (and vice versa) can be found in spark.mllib.util.MLUtils."

On Wed, Jan 10, 2018 at 11:42 AM, akshaybhatt14495 <[email protected]

wrote:

Got another error in command knn.fit(training)

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType( SchemaUtils.scala:42) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema( Predictor.scala:51) at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$ classification$ClassifierParams$$super$validateAndTransformSchema( Classifier.scala:58)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356509657, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsza37M_ilA73w7wDmrhCp4Zj3sOBks5tJFSwgaJpZM4RXy9_ .

Jan 10 '18 06:01 kaushikacharya

@kaushikacharya spark version is 2.2.0

Jan 10 '18 12:01 akshaybhatt14495

Have a look at https://github.com/saurfang/spark-knn/blob/master/project/Dependencies.scala val sparktest = "org.apache.spark" %% "spark-core" % "2.1.0" % "test" classifier "tests"

Also in build.sbt you can see commonSettings which is defined in Common.scala This mentions: sparkVersion := "2.1.0",

My understanding is that this repository is updated for spark 2.1.0 You using 2.2.0 could be the reason for the errors which you are facing.

Jan 10 '18 12:01 kaushikacharya

i changed my version and now working with spark 2.1.0, then also got same error,

Jan 11 '18 04:01 akshaybhatt14495

Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt

java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1] at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:147) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:496) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:491)

Jan 11 '18 05:01 akshaybhatt14495

You are facing the same issue as: https://github.com/saurfang/spark-knn/issues/21

Your error says that: Sampling fraction (1.01) must be on interval [0, 1]

sampling fraction needs to be <= 1

I would suggest first try running on mnist data (mnist.bz2) from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/ Put this data in your data folder and run the mnist scala example.

On Thu, Jan 11, 2018 at 10:43 AM, akshaybhatt14495 <[email protected]

wrote:

Ok, i used MLUtils function convertVectorColumnsFromML(training, "features") so then got new error for sample data given in sample_libsvm_data.txt

java.lang.IllegalArgumentException: requirement failed: Sampling fraction (1.01) must be on interval [0, 1] at scala.Predef$.require(Predef.scala:224) at org.apache.spark.util.random.BernoulliSampler.(RandomSampler.scala:147) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:496) at org.apache.spark.rdd.RDD$$anonfun$sample$2.apply(RDD.scala:491)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/saurfang/spark-knn/issues/32#issuecomment-356827852, or mute the thread https://github.com/notifications/unsubscribe-auth/AEWfsxNAhZFMh4jkQg9lKEJQprHF742Xks5tJZiUgaJpZM4RXy9_ .

Jan 11 '18 06:01 kaushikacharya

spark-knn spark-knn copied to clipboard

knn.fit(training) throws an exception

spark-knn
spark-knn copied to clipboard