dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

Incremental wrapper fails for IncrementalPCA

Open huberl opened this issue 3 years ago • 1 comments
trafficstars

What happened: When calling the fit() function on the Incremental wrapper with IncrementalPCA, the following error gets thrown: AttributeError: 'numpy.ndarray' object has no attribute 'chunks'. It seems like the Dask Array is internally converted to a Numpy array, which is wrong. I also looked at the scoring parameter, but it is not applicable for PCA and should not cause any issues during fit.

What you expected to happen: The Incremental wrapper should not convert the dask array to a Numpy array internally.

Minimal Complete Verifiable Example:

from dask_ml.decomposition import IncrementalPCA
from dask_ml.wrappers import Incremental

X, _ = make_classification(n_samples=100000, n_features=100, chunks=10000)
pca = IncrementalPCA(n_components=8, batch_size=40000)
inc = Incremental(pca)
inc.partial_fit(X)
pca.partial_fit(X)    # This works

Environment:

  • Dask version: 2022.1
  • Python version: 3.9
  • Operating System: Ubuntu
  • Install method (conda, pip, source): Conda

huberl avatar Apr 13 '22 10:04 huberl

I'm curious: why are you using Incremental and IncrementalPCA together? I think that decomposition.IncrementalPCA expects a Dask Array. But Incremental feeds through chunks of Dask Arrays to the underlying estimator.

TomAugspurger avatar Apr 16 '22 13:04 TomAugspurger