dask-ml
dask-ml copied to clipboard
[WIP] Fix empty partition prediction with ParallelPostFit
Thiis PR fixes https://github.com/dask/dask-ml/issues/911
Thanks @VibhuJawa. When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?
Thanks for reviewing the issue tom.
When looking at the traceback in https://github.com/dask/dask-ml/issues/911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?
So i tried exploring a clean way to expose the self._validate_data parameter but came up with nothing. Open to any ideas on that front.
I think the problem is that each family of models in sklearn calls it with different parameters, (See below) and I am also not sure if all models will just work even if we can some how coerce it to accept them. (See related discussion here) .
Please let me know if the approach i am taking in this PR is not feasible. I will try to explore other ways to go about solving this problem.
@TomAugspurger - Any further feedback or guidance here? I don't see a way to expose check_array's kwargs without changes to sklearn. IMO, it seems reasonable to handle this case in dask-ml.
cc @jrbourbeau @betatim for vis
I think what is happening is that predict() (and co) are being called with an empty input that contains zero samples. It seems sensibly for the scikit-learn estimators to consider that an error. So I agree with Mike that this is something that dask-ml should handle. Probably by not calling predict(), transform() etc when a partition is empty and instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).
instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be,
Nonealso an empty array?).
In this PR,
-
If the output is supposed to be arrays , I return empty arrays (for both sparse and dense arrays)
-
If the output is supposed to be dataframe (or dataframe like objec), I return an empty dataframe like objects
https://github.com/dask/dask-ml/blob/28b97e0e903e09b3af4ee3c1dbbeadc7b0b57915/dask_ml/wrappers.py#L661-L688
This also matches what cuML does (which returns an empty series) . See below:
>>> type(reg)
<class 'cuml.linear_model.logistic_regression.LogisticRegression'>
>>> reg.predict(X_new.iloc[:0])
Series([], dtype: float32)