yellowbrick [WIP] Explained Variance

This PR fixes #316 and extends #954 to get it wrapped up.

I have made the following changes:

Added documentation and tests
Refactored the quick method according to the quick method sprint

Sample Code and Plot

If you are adding or modifying a visualizer, PLEASE include a sample plot here along with the code you used to generate it.

TODOs and questions

Still to do:

[x] Integrate #954
[ ] Update visualizer
[ ] Write documentation
[ ] Add better tests

CHECKLIST

[x] Is the commit message formatted correctly?
[x] Have you noted the new functionality/bugfix in the release notes of the next release?

[ ] Included a sample plot to visually illustrate your changes?
[x] Do all of your functions and methods have docstrings?
[ ] Have you added/updated unit tests where appropriate?
[ ] Have you updated the baseline images if necessary?
[x] Have you run the unit tests using pytest?
[x] Is your code style correct (are you using PEP8, pyflakes)?
[x] Have you documented your new feature/functionality in the docs?

[x] Have you built the docs using make html?

Feb 11 '20 00:02 bbengfort

Resources:

https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html
https://towardsdatascience.com/an-approach-to-choosing-the-number-of-components-in-a-principal-component-analysis-pca-3b9f3d6e73fe
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/fill_between_demo.html

Feb 25 '20 23:02 bbengfort

@naresh-bachwani prototype version:

Explained Variance Prototype

Feb 25 '20 23:02 bbengfort

Hey there @bbengfort, The above plot looks quite good! Few features that we can add:

We have a parameter cutoff which takes in the threshold value and returns the number of component corresponding to it. We can also have a reverse scenario where we can pass a parameter n_features which returns the explained variance by that number of features. Although, I am not sure if we have a use case corresponding to this.
Also, In the above example, we do not get the exact number of features(it is somewhere between 20 and 30). It would be better if we can provide ticks corresponding to the cutoff line.

Mar 04 '20 13:03 naresh-bachwani

Questions: Is 85% variance some kind of magical number similar to p=0.05, or "rule of thumb" correlation coefficients or effect size? Below is an example on some of the expected magic numbers in "rule of thumb" tables:

	Correlation	Cohen's d	Angular	p-value	Odds Ratio
Very Strong	> 0.8	> 2	< 15°	< 0.001	> 10
Strong	> 0.6	> 1.2	< 30°	< 0.005	> 3
Moderate	> 0.4	> 0.8	<45°	< 0.01	> 1.5
Weak	> 0.2	> 0.5	< 60°	< 0.05	> 1.2
Very Weak	> 0.1	>0.2	< 75°	< 0.10	> 1.0
Negligible	> 0	> 0	= 90°	< 1	= 1

Reference: https://archive.fo/Bi6yF https://archive.fo/ourlZ https://archive.fo/hVKdt https://archive.fo/jQOX8 https://archive.fo/Xh9xk

Mar 12 '22 14:03 BradKML

@BrandonKMLee - yes, exactly right, it's just a rule of thumb about how many components are needed to have a "very strong" correlation or "enough explained variance" similar to the table you provided or a p value of 0.05.

May 21 '22 18:05 bbengfort

@bbengfort Thanks for the issue mention.

Looking at the above plot (which is pretty good) I would mention a few things.

I'm not convinced about plotting both explained var and cumulative variance. From my experience I have only ever plotted cumulative explained variance as generally it's easier to visually gauge the rate at which you approach 1. A steep curve is good, flat curve is bad. It also opens up things like area-under-curve.
Do we want this as a percentage (0-100) instead of (0-1)? People like percentages, or failing this we want to reference the fact that the value is an explained variance ratio/proportion/relative EV and not just "explained variance". These values have been normalized to 1 and not just the scaled singular values.
I'm happy with a default of 85% choice for cumulative variance assuming this parameter can be changed

May 21 '22 21:05 gregparkes

Some alternative ideas:

Visuals for K1 or Kaiser criterion https://en.wikipedia.org/wiki/Exploratory_factor_analysis#Kaiser_criterion
The "broken stick" method https://blogs.sas.com/content/iml/2017/08/02/retain-principal-components.html

May 22 '22 00:05 BradKML

yellowbrick yellowbrick copied to clipboard

[WIP] Explained Variance

Sample Code and Plot

TODOs and questions

CHECKLIST

yellowbrick
yellowbrick copied to clipboard