spark-sklearn icon indicating copy to clipboard operation
spark-sklearn copied to clipboard

(Deprecated) Scikit-learn integration package for Apache Spark

Results 15 spark-sklearn issues
Sort by recently updated
recently updated
newest added

I'm running 15 combinations of a Logistic Regression model with spark-sklearn and I'll see that all tasks have completed but there is a huge amount of time to collect all...

Is there any plan to support scikit-learn >=20.0?

Getting this test failure: ``` (spark_sklearn.converter_test.CSRVectorUDTTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/stoker/spark-sklearn/python/spark_sklearn/converter_test.py", line 83, in test_scipy_sparse self.assertEqual(df.count(), 1) File "/usr/local/spark/python/pyspark/sql/dataframe.py", line 522, in count return int(self._jdf.count()) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",...

The best_params_ dict seems to be missing from GridSearchCV, even if refitting is enabled. [grid_search.py#L195](https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/grid_search.py#L195) refers to that parameter, it is determined in [grid_search.py#L371](https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/grid_search.py#L371) but never actually exposed after fitting....

bug

When I was using the function 'KeyedEstimator(sklearnEstimator=LinearRegression(), yCol="y")', a error as the title occured. The verison of sklearn (0.19.2) meets the requirements. So why? Thank you.

The documentation for RandomizedSearchCV implies that a best_params_ property is available after .fit() is called. This does not appear to be the case. Here is the documentation in question: https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/random_search.py#L162...

At this line, it may be better to explicitly mention which parameters will be sampled with replacement if any one of them is a distribution: https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/random_search.py#L27 Are all parameters (those...

Currently, KeyedModel fitting in KeyedEstimator._fit is implemented by generating an array of a single serialized estimator, requiring an additional pass over the resulting dataframe which deserializes the UDT. This is...

enhancement

Using the current head 0.2.0 release of spark-sklearn and the current release of scikit-learn (0.18.1), I'm getting the following deprecation warning: /.../python3.4/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18...

enhancement

I found this incredibly convenient to create small dataframes, here is how you can use it: ``` python n = 5 A = rd.rand(n,4) C = rd.randint(10, size=n) df =...

enhancement