dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

Make automatic centering in PCA methods optional

Open hristog opened this issue 4 years ago • 6 comments
trafficstars

  • Closes: https://github.com/dask/dask-ml/issues/734.
  • Tests/black/pyflakes passing: Yes.
  • Tests added to the following files:**
    • tests/test_pca.py
    • tests/test_incremental_pca.py
  • tl;dr of design choices made:
    • IncrementalPCA doesn't get affected by the new parameter and IncrementalPCA(..., center=False ,...) is unsupported.
    • whiten=True does not imply, nor enforce center=True. Instead it warns about possible unexpected behavior, if the underlying data hasn't been centered already.

hristog avatar Mar 20 '21 12:03 hristog

Once the maintainers confirm approval of the proposed changes, I'll be happy to update the .rst docs accordingly, to document the updated API, in a separate commit (as part of this PR).

hristog avatar Mar 21 '21 19:03 hristog

From the latest CI runs (e.g., this job).

black, version 19.10b0
would reformat /home/vsts/work/1/s/tests/test_pca.py
Oh no! 💥 💔 💥
1 file would be reformatted, 99 files would be left unchanged.

Oops - pretty sure that I did this check locally. Will re-run black to confirm.

Update: This is strange. I'll check if my local branch has actually propagated, and is the same as what the Azure pipeline has picked up.

Result of ./ci/code_checks.sh run locally from the root dir of `dask-ml`:
Checking flake8...
Checking flake8... DONE
Checking black...
black, version 20.8b1
would reformat /git/dask-ml/dask_ml/model_selection/utils_test.py
would reformat /git/dask-ml/tests/ensemble/test_blockwise.py
would reformat /git/dask-ml/tests/model_selection/test_keras.py
would reformat /git/dask-ml/tests/model_selection/test_hyperband.py
would reformat /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection.py
would reformat /git/dask-ml/tests/model_selection/test_incremental.py
Oh no! 💥 💔 💥
6 files would be reformatted, 94 files would be left unchanged.
Checking black... DONE
Checking isort...
5.6.4
ERROR: /git/dask-ml/tests/test_spectral_clustering.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/test_incremental_pca.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/linear_model/test_glm.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection_sklearn.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/docs/dimensions.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/dask_ml/_compat.py Imports are incorrectly sorted and/or formatted.
...

Update (2): Seems that it's the same code version, both locally and remotely.

Locally:

$ git show
commit 01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9 (HEAD -> 734-pca-skip-centering, origin/734-pca-skip-centering)
...
    Add param validation to PCA classes (#808)

Remotely: The most recent commit (and only one since Tom's review): https://github.com/dask/dask-ml/pull/808/commits/01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9.

I'll try reproducing black's results via the same version as per the CI pipeline (black, version 19.10b0).

Update (3): Victory! It was indeed due to the different local version of black==black-20.8b1 vs remote (black==19.10b0, as per .pre-commit-config.yaml and ci/*.yaml).

hristog avatar Mar 24 '21 18:03 hristog

Hi @TomAugspurger - I'm writing to check if you've had a chance to look into the further changes that I pushed since your review, and whether you'd like any additional refinements to be done on top of that. Thanks!

hristog avatar Apr 08 '21 17:04 hristog

It'll be a week or so before I can take a look. Thanks for your patience!

In the meantime, perhaps @jameslamb has a chance to glance through this (no worries if you don't though)?

TomAugspurger avatar Apr 12 '21 11:04 TomAugspurger

It'll be a week or so before I can take a look. Thanks for your patience!

It's all good - just wanted to check if this is still on your radar. I'm happy to wait as long as necessary, and incorporate potential further feedback. Thanks for your confirmation, @TomAugspurger!

hristog avatar Apr 12 '21 12:04 hristog

I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.

Please consider some of the review comments I left.

I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called center doesn't tell you much about the functional benefit of the changes in this pull request.

Thanks for your review, @jameslamb! I'll try to address your comments and make changes, wherever applicable, ASAP.

hristog avatar Apr 12 '21 21:04 hristog