dask-ml
dask-ml copied to clipboard
Make automatic centering in PCA methods optional
- Closes: https://github.com/dask/dask-ml/issues/734.
- Tests/
black/pyflakespassing: Yes. - Tests added to the following files:**
tests/test_pca.pytests/test_incremental_pca.py
- tl;dr of design choices made:
IncrementalPCAdoesn't get affected by the new parameter andIncrementalPCA(..., center=False ,...)is unsupported.whiten=Truedoes not imply, nor enforcecenter=True. Instead it warns about possible unexpected behavior, if the underlying data hasn't been centered already.
Once the maintainers confirm approval of the proposed changes, I'll be happy to update the .rst docs accordingly, to document the updated API, in a separate commit (as part of this PR).
From the latest CI runs (e.g., this job).
black, version 19.10b0
would reformat /home/vsts/work/1/s/tests/test_pca.py
Oh no! 💥 💔 💥
1 file would be reformatted, 99 files would be left unchanged.
Oops - pretty sure that I did this check locally. Will re-run black to confirm.
Update: This is strange. I'll check if my local branch has actually propagated, and is the same as what the Azure pipeline has picked up.
Result of ./ci/code_checks.sh run locally from the root dir of `dask-ml`:
Checking flake8...
Checking flake8... DONE
Checking black...
black, version 20.8b1
would reformat /git/dask-ml/dask_ml/model_selection/utils_test.py
would reformat /git/dask-ml/tests/ensemble/test_blockwise.py
would reformat /git/dask-ml/tests/model_selection/test_keras.py
would reformat /git/dask-ml/tests/model_selection/test_hyperband.py
would reformat /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection.py
would reformat /git/dask-ml/tests/model_selection/test_incremental.py
Oh no! 💥 💔 💥
6 files would be reformatted, 94 files would be left unchanged.
Checking black... DONE
Checking isort...
5.6.4
ERROR: /git/dask-ml/tests/test_spectral_clustering.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/test_incremental_pca.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/linear_model/test_glm.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/tests/model_selection/dask_searchcv/test_model_selection_sklearn.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/docs/dimensions.py Imports are incorrectly sorted and/or formatted.
ERROR: /git/dask-ml/dask_ml/_compat.py Imports are incorrectly sorted and/or formatted.
...
Update (2): Seems that it's the same code version, both locally and remotely.
Locally:
$ git show
commit 01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9 (HEAD -> 734-pca-skip-centering, origin/734-pca-skip-centering)
...
Add param validation to PCA classes (#808)
Remotely: The most recent commit (and only one since Tom's review): https://github.com/dask/dask-ml/pull/808/commits/01dd9d4f71ee380e5b3e1a969f3bd46bebda6cd9.
I'll try reproducing black's results via the same version as per the CI pipeline (black, version 19.10b0).
Update (3): Victory! It was indeed due to the different local version of black==black-20.8b1 vs remote (black==19.10b0, as per .pre-commit-config.yaml and ci/*.yaml).
Hi @TomAugspurger - I'm writing to check if you've had a chance to look into the further changes that I pushed since your review, and whether you'd like any additional refinements to be done on top of that. Thanks!
It'll be a week or so before I can take a look. Thanks for your patience!
In the meantime, perhaps @jameslamb has a chance to glance through this (no worries if you don't though)?
It'll be a week or so before I can take a look. Thanks for your patience!
It's all good - just wanted to check if this is still on your radar. I'm happy to wait as long as necessary, and incorporate potential further feedback. Thanks for your confirmation, @TomAugspurger!
I had some time to go through and review today. Overall, I'm in favor of the change and I can understand how it could be a nice optimization for experienced users.
Please consider some of the review comments I left.
I also think it would improve understanding for people who find this pull request from search or see it in a changelog if you change the PR title to something like "Make automatic centering in PCA methods optional". By itself, the fact that there is now a parameter called
centerdoesn't tell you much about the functional benefit of the changes in this pull request.
Thanks for your review, @jameslamb! I'll try to address your comments and make changes, wherever applicable, ASAP.