sparkit-learn
sparkit-learn copied to clipboard
ValueError: This solver needs samples of at least 2 classes in the data
Hi,
I am using SparkLinearSVC. The code is as follows:
svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))
and I get the following error:
File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
return func(split, prev_func(split, iterator))
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
return f(iterator)
File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
mapper = lambda X_y: super(cls, self).fit(
File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
self.loss
File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
" class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively
Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()
To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.
Thanks
I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:
for trainRatio in np.arange(0.05, 1, 0.05):
split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
for trainIdx, testIdx in split.split(X, y):
Xtrain, Xtest = X[trainIdx], X[testIdx]
ytrain, ytest = y[trainIdx], y[testIdx]
model = someModel()
model.fit(Xtrain, ytrain)
pred = model.predict(Xtest)
Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()
To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.
Can't believe that this bug is still not fixed! Sad!