pyjanitor
pyjanitor copied to clipboard
[ENH] New Machine Learning Features
Brief Description
I've been thinking of a few ML additions for pyjanitor. Below are 3 ideas that I think would be helpful to add to the API:
- A method to drop feature based on a variance threshold (think sklearn
VarianceThreshold
) - A method to cap outliers that fall below or above a certain quantile (if below .05 replace with .05 or specified value, if above 0.95 replace with 0.95 value or specified value)
- Simple
standardize
andscale
methods (think sklearnStandardScaler
andMinMaxScaler
)
I realize that (1) and (3) above have sklearn implementations, but I am thinking it might be nice to have a simplified version of these that we can reach for when wanting to do some quick analysis/data prep without having to load sklearn and fit the object. This might be nice for pyjanitor's chaining approach. Notes in the docstrings can point users to sklearn for more robust solutions (and if desiring to retain information to apply to other datas).
Example API
# Drop features based on variance threshold
>>> df = pd.DataFrame({
... 'A': [1, 2, 3, 4, 5],
... 'B': [2, 2, 2, 2, 2],
... 'C': [1, 1, 1, 1, 2]
... })
>>> df
A B C
0 1 2 1
1 2 2 1
2 3 2 1
3 4 2 1
4 5 2 2
>>> df.var()
A 2.5
B 0.0
C 0.2
dtype: float64
>>> threshold = 0.0 # default
>>> keep_features = df.var() > threshold
>>> df.loc[:, keep_features]
A C
0 1 1
1 2 1
2 3 1
3 4 1
4 5 2
>>> threshold = 0.5 # custom
>>> keep_features = df.var() > threshold
>>> df.loc[:, keep_features]
A
0 1
1 2
2 3
3 4
4 5
# Cap outliers based on certain quantile
>>> df = pd.DataFrame({
... 'A': [2, 4, 8, 16, 32],
... 'B': [-999, 2, 4, 6, 999]
... })
>>> df
A B
0 2 -999
1 4 2
2 8 4
3 16 6
4 32 999
>>> df.quantile(.05)
A 2.4
B -798.8
Name: 0.05, dtype: float64
>>> df.mask(df < df.quantile(.05), df.quantile(.05), axis=1)
A B
0 2.4 -798.8
1 4.0 2.0
2 8.0 4.0
3 16.0 6.0
4 32.0 999.0
>>> df.quantile(.95)
A 28.8
B 800.4
>>> df.mask(df > df.quantile(.95), df.quantile(.95), axis=1)
A B
0 2.0 -999.0
1 4.0 2.0
2 8.0 4.0
3 16.0 6.0
4 28.8 800.4
>>> df.mask(df > df.quantile(.95), df.quantile(.95), axis=1).mask(df < df.quantile(.05), df.quantile(.05), axis=1)
A B
0 2.4 -798.8
1 4.0 2.0
2 8.0 4.0
3 16.0 6.0
4 28.8 800.4
>>> df.mask(df > df.quantile(.95), 42, axis=1).mask(df < df.quantile(.05), -1, axis=1)
A B
0 -1 -1
1 4 2
2 8 4
3 16 6
4 42 42
# Standardize
>>> df = pd.DataFrame({
... 'A': [2, 4, 8, 16, 32],
... 'B': [1, 2, 3, 4, 5],
... 'C': [-15, -10, 0, 10, 15],
... })
>>> df
A B C
0 2 1 -15
1 4 2 -10
2 8 3 0
3 16 4 10
4 32 5 15
>>> (df - df.mean())/df.std(ddof=0)
A B C
0 -0.953206 -1.414214 -1.315587
1 -0.769897 -0.707107 -0.877058
2 -0.403280 0.000000 0.000000
3 0.329956 0.707107 0.877058
4 1.796427 1.414214 1.315587
# Just as a sniff test (can have ddof as an argument into the function)
# https://drorata.github.io/posts/2017/Sep/10/why-do-we-need-to-divide-by-n-1/index.html#Summary
>>> from sklearn.preprocessing import StandardScaler
>>> standardizer = StandardScaler()
>>> standardizer.fit_transform(df)
array([[-0.95320625, -1.41421356, -1.31558703],
[-0.76989735, -0.70710678, -0.87705802],
[-0.40327957, 0. , 0. ],
[ 0.32995601, 0.70710678, 0.87705802],
[ 1.79642716, 1.41421356, 1.31558703]])
# Scale Method
>>> df
A B C
0 2 1 -15
1 4 2 -10
2 8 3 0
3 16 4 10
4 32 5 15
>>> (df - df.min()) / (df.max() - df.min())
A B C
0 0.000000 0.00 0.000000
1 0.066667 0.25 0.166667
2 0.200000 0.50 0.500000
3 0.466667 0.75 0.833333
4 1.000000 1.00 1.000000
>>> (df - df.min()) / (df.max() - df.min()) * (1 - -1) + -1
A B C
0 -1.000000 -1.0 -1.000000
1 -0.866667 -0.5 -0.666667
2 -0.600000 0.0 0.000000
3 -0.066667 0.5 0.666667
4 1.000000 1.0 1.000000
# Just as a sniff test
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit_transform(df)
array([[0. , 0. , 0. ],
[0.06666667, 0.25 , 0.16666667],
[0.2 , 0.5 , 0.5 ],
[0.46666667, 0.75 , 0.83333333],
[1. , 1. , 1. ]])
>>> scaler = MinMaxScaler(feature_range=(-1,1))
>>> scaler.fit_transform(df)
array([[-1. , -1. , -1. ],
[-0.86666667, -0.5 , -0.66666667],
[-0.6 , 0. , 0. ],
[-0.06666667, 0.5 , 0.66666667],
[ 1. , 1. , 1. ]])
@ericmjl curious to hear your thoughts before starting to develop any/all of these ideas. Thanks!
FYI, we have a min_max_scale
already https://pyjanitor-devs.github.io/pyjanitor/reference/janitor.functions/janitor.min_max_scale.html#janitor-min-max-scale
These are great functions to have as a convenience, @loganthomas! I'm in favour of having these in the library. The docstrings, however, should come with a strong warning: in proper machine learning practice, any transformations should still be part of an sklearn pipeline so that we don't have info leakage from the training set into the test set. (How that happens goes deep, we can definitely talk about this in more detail.)
Super glad you brought this one up, Logan!