sherlock-project icon indicating copy to clipboard operation
sherlock-project copied to clipboard

Doubt about Impute Nan values

Open GiacomoPracucci opened this issue 11 months ago • 1 comments

I am referring to the code in 01-data-preprocessing.ipynb, regarding the paragraph Impute NaN values with feature means.

Currently, the nan values in extracted features are imputed with the average of the train sample column. It means calculating the average considering all vectors, of different classes.

train_columns_means = pd.DataFrame(X_train.mean()).transpose()
X_train.fillna(train_columns_means.iloc[0], inplace=True)
X_validation.fillna(train_columns_means.iloc[0], inplace=True)
X_test.fillna(train_columns_means.iloc[0], inplace=True)

Wouldn't it be a better option to calculate the averages for each class and replace any nan values with the values of the specific class? We could append the train_labels.parquet types to the data, group by type and compute the averages per class, saving the results in train_columns_means.

Am I missing some theoretical concept or would this actually be an improvement to the system?

GiacomoPracucci avatar Mar 14 '24 11:03 GiacomoPracucci

Hi Giacomo,

I think that is indeed a valid improvement for imputing missing values. By imputing the average across all classes, we intended to avoid learning patterns across missing data, but at the same time the model might miss some signal because of this. I can imagine that going with class-specific averages improves performance. One thing to be careful with is to not include any data from the test dataset in calculating these averages to avoid leakage.

Good luck!

Madelon

madelonhulsebos avatar Mar 14 '24 21:03 madelonhulsebos