pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] New Machine Learning Features

Open loganthomas opened this issue 3 years ago • 2 comments

Brief Description

I've been thinking of a few ML additions for pyjanitor. Below are 3 ideas that I think would be helpful to add to the API:

  1. A method to drop feature based on a variance threshold (think sklearn VarianceThreshold)
  2. A method to cap outliers that fall below or above a certain quantile (if below .05 replace with .05 or specified value, if above 0.95 replace with 0.95 value or specified value)
  3. Simple standardize and scale methods (think sklearn StandardScaler and MinMaxScaler)

I realize that (1) and (3) above have sklearn implementations, but I am thinking it might be nice to have a simplified version of these that we can reach for when wanting to do some quick analysis/data prep without having to load sklearn and fit the object. This might be nice for pyjanitor's chaining approach. Notes in the docstrings can point users to sklearn for more robust solutions (and if desiring to retain information to apply to other datas).

Example API

# Drop features based on variance threshold
>>> df = pd.DataFrame({
...     'A': [1, 2, 3, 4, 5],
...     'B': [2, 2, 2, 2, 2],
...     'C': [1, 1, 1, 1, 2]
... })
>>> df
   A  B  C
0  1  2  1
1  2  2  1
2  3  2  1
3  4  2  1
4  5  2  2
>>> df.var()
A    2.5
B    0.0
C    0.2
dtype: float64
>>> threshold = 0.0  # default
>>> keep_features = df.var() > threshold
>>> df.loc[:, keep_features]
   A  C
0  1  1
1  2  1
2  3  1
3  4  1
4  5  2
>>> threshold = 0.5  # custom
>>> keep_features = df.var() > threshold
>>> df.loc[:, keep_features]
   A
0  1
1  2
2  3
3  4
4  5

# Cap outliers based on certain quantile
>>> df = pd.DataFrame({
...     'A': [2, 4, 8, 16, 32],
...     'B': [-999, 2, 4, 6, 999]
... })
>>> df
    A    B
0   2 -999
1   4    2
2   8    4
3  16    6
4  32  999
>>> df.quantile(.05)
A      2.4
B   -798.8
Name: 0.05, dtype: float64
>>> df.mask(df < df.quantile(.05), df.quantile(.05), axis=1)
      A      B
0   2.4 -798.8
1   4.0    2.0
2   8.0    4.0
3  16.0    6.0
4  32.0  999.0
>>> df.quantile(.95)
A     28.8
B    800.4
>>> df.mask(df > df.quantile(.95), df.quantile(.95), axis=1)
      A      B
0   2.0 -999.0
1   4.0    2.0
2   8.0    4.0
3  16.0    6.0
4  28.8  800.4
>>> df.mask(df > df.quantile(.95), df.quantile(.95), axis=1).mask(df < df.quantile(.05), df.quantile(.05), axis=1)
      A      B
0   2.4 -798.8
1   4.0    2.0
2   8.0    4.0
3  16.0    6.0
4  28.8  800.4
>>> df.mask(df > df.quantile(.95), 42, axis=1).mask(df < df.quantile(.05), -1, axis=1)
    A   B
0  -1  -1
1   4   2
2   8   4
3  16   6
4  42  42

# Standardize
>>> df = pd.DataFrame({
...     'A': [2, 4, 8, 16, 32],
...     'B': [1, 2, 3, 4, 5],
...     'C': [-15, -10, 0, 10, 15],
... })
>>> df
    A  B   C
0   2  1 -15
1   4  2 -10
2   8  3   0
3  16  4  10
4  32  5  15
>>> (df - df.mean())/df.std(ddof=0)
          A         B         C
0 -0.953206 -1.414214 -1.315587
1 -0.769897 -0.707107 -0.877058
2 -0.403280  0.000000  0.000000
3  0.329956  0.707107  0.877058
4  1.796427  1.414214  1.315587

# Just as a sniff test (can have ddof as an argument into the function)
# https://drorata.github.io/posts/2017/Sep/10/why-do-we-need-to-divide-by-n-1/index.html#Summary
>>> from sklearn.preprocessing import StandardScaler
>>> standardizer = StandardScaler()
>>> standardizer.fit_transform(df)
array([[-0.95320625, -1.41421356, -1.31558703],
       [-0.76989735, -0.70710678, -0.87705802],
       [-0.40327957,  0.        ,  0.        ],
       [ 0.32995601,  0.70710678,  0.87705802],
       [ 1.79642716,  1.41421356,  1.31558703]])


# Scale Method
>>> df
    A  B   C
0   2  1 -15
1   4  2 -10
2   8  3   0
3  16  4  10
4  32  5  15

>>> (df - df.min()) / (df.max() - df.min())
          A     B         C
0  0.000000  0.00  0.000000
1  0.066667  0.25  0.166667
2  0.200000  0.50  0.500000
3  0.466667  0.75  0.833333
4  1.000000  1.00  1.000000

>>> (df - df.min()) / (df.max() - df.min()) * (1 - -1) + -1
          A    B         C
0 -1.000000 -1.0 -1.000000
1 -0.866667 -0.5 -0.666667
2 -0.600000  0.0  0.000000
3 -0.066667  0.5  0.666667
4  1.000000  1.0  1.000000

# Just as a sniff test
>>> from sklearn.preprocessing import MinMaxScaler
>>> scaler = MinMaxScaler()
>>> scaler.fit_transform(df)
array([[0.        , 0.        , 0.        ],
       [0.06666667, 0.25      , 0.16666667],
       [0.2       , 0.5       , 0.5       ],
       [0.46666667, 0.75      , 0.83333333],
       [1.        , 1.        , 1.        ]])
>>> scaler = MinMaxScaler(feature_range=(-1,1))
>>> scaler.fit_transform(df)
array([[-1.        , -1.        , -1.        ],
       [-0.86666667, -0.5       , -0.66666667],
       [-0.6       ,  0.        ,  0.        ],
       [-0.06666667,  0.5       ,  0.66666667],
       [ 1.        ,  1.        ,  1.        ]])

@ericmjl curious to hear your thoughts before starting to develop any/all of these ideas. Thanks!

loganthomas avatar Aug 19 '21 13:08 loganthomas

FYI, we have a min_max_scale already https://pyjanitor-devs.github.io/pyjanitor/reference/janitor.functions/janitor.min_max_scale.html#janitor-min-max-scale

loganthomas avatar Aug 19 '21 13:08 loganthomas

These are great functions to have as a convenience, @loganthomas! I'm in favour of having these in the library. The docstrings, however, should come with a strong warning: in proper machine learning practice, any transformations should still be part of an sklearn pipeline so that we don't have info leakage from the training set into the test set. (How that happens goes deep, we can definitely talk about this in more detail.)

Super glad you brought this one up, Logan!

ericmjl avatar Aug 19 '21 17:08 ericmjl