boltzmannclean
boltzmannclean copied to clipboard
Clean on Training and use the information from that to clean on Test
I have a dataset wherein both test and training dataset have missing values. Just like in sklearn where we train using fit_transform and use that later to transform test data. I want to clean the training set and using that info test set should also be clean. How to do this.?
Hi @VikasNS, you will need to slightly modify the clean
function for this use case. With the below changes:
- def clean(dataframe, numerical_columns, categorical_columns, tune_rbm):
+ def clean(dataframe, numerical_columns, categorical_columns, tune_rbm, rbm=None):
and
- rbm = train_rbm(preprocessed_array, tune_hyperparameters=tune_rbm)
+ if rbm is None:
+ rbm = train_rbm(preprocessed_array, tune_hyperparameters=tune_rbm)
You will be able to do the following:
import boltzmannclean
import numpy as np
numerical_columns = ['list', 'of', 'numerical', 'column', 'names']
categorical_columns = ['list', 'of', 'categorical', 'column', 'names']
numerics, scaler = boltzmannclean.preprocess_numerics(
training_dataframe, numerical_columns
)
categoricals, category_dict = boltzmannclean.preprocess_categoricals(
training_dataframe, categorical_columns
)
preprocessed_array = np.hstack((numerics, categoricals))
pretrained_rbm = boltzmannclean.train_rbm(
preprocessed_array, tune_hyperparameters=True # or False, up to you
)
cleaned_training_dataframe = boltzmannclean.clean(
training_dataframe, numerical_columns, categorical_columns,
tune_rbm=False, rbm=pretrained_rbm
)
cleaned_test_dataframe = boltzmannclean.clean(
test_dataframe, numerical_columns, categorical_columns,
tune_rbm=False, rbm=pretrained_rbm
)