sherlock-project
sherlock-project copied to clipboard
Doubt about Impute Nan values
I am referring to the code in 01-data-preprocessing.ipynb, regarding the paragraph Impute NaN values with feature means.
Currently, the nan values in extracted features are imputed with the average of the train sample column. It means calculating the average considering all vectors, of different classes.
train_columns_means = pd.DataFrame(X_train.mean()).transpose()
X_train.fillna(train_columns_means.iloc[0], inplace=True)
X_validation.fillna(train_columns_means.iloc[0], inplace=True)
X_test.fillna(train_columns_means.iloc[0], inplace=True)
Wouldn't it be a better option to calculate the averages for each class and replace any nan values with the values of the specific class? We could append the train_labels.parquet types to the data, group by type and compute the averages per class, saving the results in train_columns_means.
Am I missing some theoretical concept or would this actually be an improvement to the system?
Hi Giacomo,
I think that is indeed a valid improvement for imputing missing values. By imputing the average across all classes, we intended to avoid learning patterns across missing data, but at the same time the model might miss some signal because of this. I can imagine that going with class-specific averages improves performance. One thing to be careful with is to not include any data from the test dataset in calculating these averages to avoid leakage.
Good luck!
Madelon