healthcareai-py Utilities :: Stratified Sampling

Utilities :: Stratified Sampling

Open Aylr opened this issue 7 years ago • 1 comments

I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's train_test_split() and appears to be solid.

TODO

[ ] Tests (should be fairly simple since train_test_split is covered
[ ] create a utilities module

Code

from sklearn.model_selection import train_test_split
import pandas as pd

def stratified_sample(df, stratified_column, test_size=0.1, verbose=False):
    """Build a stratified sampled dataframe."""
    def _glue(y_column, x_column_names, x, y):
        temp_df = pd.DataFrame(x)
        temp_df.columns = x_column_names
        temp_df[y_column] = y
        return temp_df

    x_df = df.drop(stratified_column, axis='columns')
    y_df = df[stratified_column]
    x = x_df.as_matrix()
    y = y_df.as_matrix()
    x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=test_size)
    x_columns = x_df.columns

    train_df = _glue(stratified_column, x_columns, x_train, y_train)
    test_df = _glue(stratified_column, x_columns, x_test, y_test)

    if verbose:
        print('Original:\n', df[stratified_column].value_counts(), '\n')
        print('Sampled down to ({}) records:\n'.format(len(test_df)), test_df[stratified_column].value_counts(), '\n')
        
        df.final_state.value_counts().plot.barh(title='Original Dataset')
        plt.show()
        test_df.final_state.value_counts().plot.barh(title='Sampled Dataset')
        plt.show()

    return train_df, test_df

Usage

df = pd.DataFrame({
    'id': list(range(40)),
    'other': [- x for x in range(40)],
    'foo': [1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1]
    })
train, holdout = stratified_sample(df, stratified_column='foo', test_size=0.1)
print(len(holdout))
holdout.foo.hist()

Jan 29 '18 23:01 Aylr

Example verbose output:

Jan 30 '18 20:01 Aylr

healthcareai-py healthcareai-py copied to clipboard

Utilities :: Stratified Sampling

TODO

Code

Usage

healthcareai-py
healthcareai-py copied to clipboard