yellowbrick
yellowbrick copied to clipboard
[WIP] Explained Variance
This PR fixes #316 and extends #954 to get it wrapped up.
I have made the following changes:
- Added documentation and tests
- Refactored the quick method according to the quick method sprint
Sample Code and Plot
If you are adding or modifying a visualizer, PLEASE include a sample plot here along with the code you used to generate it.
TODOs and questions
Still to do:
- [x] Integrate #954
- [ ] Update visualizer
- [ ] Write documentation
- [ ] Add better tests
CHECKLIST
- [x] Is the commit message formatted correctly?
- [x] Have you noted the new functionality/bugfix in the release notes of the next release?
- [ ] Included a sample plot to visually illustrate your changes?
- [x] Do all of your functions and methods have docstrings?
- [ ] Have you added/updated unit tests where appropriate?
- [ ] Have you updated the baseline images if necessary?
- [x] Have you run the unit tests using
pytest? - [x] Is your code style correct (are you using PEP8, pyflakes)?
- [x] Have you documented your new feature/functionality in the docs?
- [x] Have you built the docs using
make html?
Resources:
- https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
- https://towardsdatascience.com/an-approach-to-choosing-the-number-of-components-in-a-principal-component-analysis-pca-3b9f3d6e73fe
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
- https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/fill_between_demo.html
@naresh-bachwani prototype version:

Hey there @bbengfort, The above plot looks quite good! Few features that we can add:
- We have a parameter
cutoffwhich takes in the threshold value and returns the number of component corresponding to it. We can also have a reverse scenario where we can pass a parametern_featureswhich returns the explained variance by that number of features. Although, I am not sure if we have a use case corresponding to this. - Also, In the above example, we do not get the exact number of features(it is somewhere between 20 and 30). It would be better if we can provide ticks corresponding to the cutoff line.
Questions: Is 85% variance some kind of magical number similar to p=0.05, or "rule of thumb" correlation coefficients or effect size? Below is an example on some of the expected magic numbers in "rule of thumb" tables:
| Correlation | Cohen's d | Angular | p-value | Odds Ratio | |
|---|---|---|---|---|---|
| Very Strong | > 0.8 | > 2 | < 15° | < 0.001 | > 10 |
| Strong | > 0.6 | > 1.2 | < 30° | < 0.005 | > 3 |
| Moderate | > 0.4 | > 0.8 | <45° | < 0.01 | > 1.5 |
| Weak | > 0.2 | > 0.5 | < 60° | < 0.05 | > 1.2 |
| Very Weak | > 0.1 | >0.2 | < 75° | < 0.10 | > 1.0 |
| Negligible | > 0 | > 0 | = 90° | < 1 | = 1 |
Reference: https://archive.fo/Bi6yF https://archive.fo/ourlZ https://archive.fo/hVKdt https://archive.fo/jQOX8 https://archive.fo/Xh9xk
@BrandonKMLee - yes, exactly right, it's just a rule of thumb about how many components are needed to have a "very strong" correlation or "enough explained variance" similar to the table you provided or a p value of 0.05.
@bbengfort Thanks for the issue mention.
Looking at the above plot (which is pretty good) I would mention a few things.
- I'm not convinced about plotting both explained var and cumulative variance. From my experience I have only ever plotted cumulative explained variance as generally it's easier to visually gauge the rate at which you approach 1. A steep curve is good, flat curve is bad. It also opens up things like area-under-curve.
- Do we want this as a percentage (0-100) instead of (0-1)? People like percentages, or failing this we want to reference the fact that the value is an explained variance ratio/proportion/relative EV and not just "explained variance". These values have been normalized to 1 and not just the scaled singular values.
- I'm happy with a default of 85% choice for cumulative variance assuming this parameter can be changed
Some alternative ideas:
- Visuals for K1 or Kaiser criterion https://en.wikipedia.org/wiki/Exploratory_factor_analysis#Kaiser_criterion
- The "broken stick" method https://blogs.sas.com/content/iml/2017/08/02/retain-principal-components.html