dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

[WIP] Fix empty partition prediction with ParallelPostFit

Open VibhuJawa opened this issue 3 years ago • 5 comments

Thiis PR fixes https://github.com/dask/dask-ml/issues/911

VibhuJawa avatar Mar 25 '22 02:03 VibhuJawa

Thanks @VibhuJawa. When looking at the traceback in #911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

TomAugspurger avatar Mar 25 '22 14:03 TomAugspurger

Thanks for reviewing the issue tom.

When looking at the traceback in https://github.com/dask/dask-ml/issues/911, I see that skearn's check_array takes an ensure_min_samples parameter. If we pass ensure_min_samples=False there, does stuff go through properly?

So i tried exploring a clean way to expose the self._validate_data parameter but came up with nothing. Open to any ideas on that front.

I think the problem is that each family of models in sklearn calls it with different parameters, (See below) and I am also not sure if all models will just work even if we can some how coerce it to accept them. (See related discussion here) .

  1. _base_chain

  2. naive_bayes

  3. kmeans

  4. _kernel_pca

Please let me know if the approach i am taking in this PR is not feasible. I will try to explore other ways to go about solving this problem.

VibhuJawa avatar Mar 28 '22 18:03 VibhuJawa

@TomAugspurger - Any further feedback or guidance here? I don't see a way to expose check_array's kwargs without changes to sklearn. IMO, it seems reasonable to handle this case in dask-ml.

cc @jrbourbeau @betatim for vis

mmccarty avatar Sep 20 '22 19:09 mmccarty

I think what is happening is that predict() (and co) are being called with an empty input that contains zero samples. It seems sensibly for the scikit-learn estimators to consider that an error. So I agree with Mike that this is something that dask-ml should handle. Probably by not calling predict(), transform() etc when a partition is empty and instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

betatim avatar Sep 21 '22 08:09 betatim

instead returning what ever is the expected result of making a prediction on an empty array (I'm not sure what this should be, None also an empty array?).

In this PR,

  1. If the output is supposed to be arrays , I return empty arrays (for both sparse and dense arrays)

  2. If the output is supposed to be dataframe (or dataframe like objec), I return an empty dataframe like objects

https://github.com/dask/dask-ml/blob/28b97e0e903e09b3af4ee3c1dbbeadc7b0b57915/dask_ml/wrappers.py#L661-L688

This also matches what cuML does (which returns an empty series) . See below:

>>> type(reg)
<class 'cuml.linear_model.logistic_regression.LogisticRegression'>
>>> reg.predict(X_new.iloc[:0])
Series([], dtype: float32)

VibhuJawa avatar Sep 21 '22 09:09 VibhuJawa