healthcareai-py
healthcareai-py copied to clipboard
Utilities :: Stratified Sampling
I created this code snippet doing some client work and it has been very helpful when I want to work with dataframes. It uses scikit-learn's train_test_split()
and appears to be solid.
TODO
- [ ] Tests (should be fairly simple since
train_test_split
is covered - [ ] create a utilities module
Code
from sklearn.model_selection import train_test_split
import pandas as pd
def stratified_sample(df, stratified_column, test_size=0.1, verbose=False):
"""Build a stratified sampled dataframe."""
def _glue(y_column, x_column_names, x, y):
temp_df = pd.DataFrame(x)
temp_df.columns = x_column_names
temp_df[y_column] = y
return temp_df
x_df = df.drop(stratified_column, axis='columns')
y_df = df[stratified_column]
x = x_df.as_matrix()
y = y_df.as_matrix()
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size=test_size)
x_columns = x_df.columns
train_df = _glue(stratified_column, x_columns, x_train, y_train)
test_df = _glue(stratified_column, x_columns, x_test, y_test)
if verbose:
print('Original:\n', df[stratified_column].value_counts(), '\n')
print('Sampled down to ({}) records:\n'.format(len(test_df)), test_df[stratified_column].value_counts(), '\n')
df.final_state.value_counts().plot.barh(title='Original Dataset')
plt.show()
test_df.final_state.value_counts().plot.barh(title='Sampled Dataset')
plt.show()
return train_df, test_df
Usage
df = pd.DataFrame({
'id': list(range(40)),
'other': [- x for x in range(40)],
'foo': [1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1,1, 1, 1, 1, 2, 1, 1, 1, 2, 1]
})
train, holdout = stratified_sample(df, stratified_column='foo', test_size=0.1)
print(len(holdout))
holdout.foo.hist()
Example verbose output:
data:image/s3,"s3://crabby-images/e5a07/e5a07f487b4fea8de5c2ca88a2b25ad9aefc877f" alt="screen shot 2018-01-30 at 1 34 37 pm"