feature_engine icon indicating copy to clipboard operation
feature_engine copied to clipboard

multivariate imputation

Open solegalli opened this issue 2 years ago • 4 comments

In multivariate imputation, we estimate the values of missing data using regression or classification models based of the other variables in the data.

The iterativeimputer will allows us only to use either one of regression or classification. But often we have binary, discrete and continuous variables in our datasets. So we would like to use a suitable model for each variable to carry out the imputation.

Can we design a transformer that does exactly so?

It would either recognise binary, multilcass and continuous variables or ask the user to enter them, and then train suitable models to predict the values of the missing data, for each variably type.

solegalli avatar Mar 30 '22 15:03 solegalli

Looks fun, @solegalli! I'm happy to tackle this issue. Which issue do you prefer we address first? This issue or #107?

Morgan-Sell avatar Apr 09 '22 18:04 Morgan-Sell

hola @solegalli,

I see sklearn has an experimental version of the IterativeImputer. Do we still want to implement this transformer into feature-engine?

When training the transformer's estimator, will the transformer organize the non-missing values for the dependent variable as the training set and all the np.nan values as the "test set" or values to be predicted?

Also, given there are most likely np.nan scattered throughout the dataset, I'm assuming we should limit the estimators to models that handle np.nan, e.g., random forest.

Morgan-Sell avatar Jul 10 '22 04:07 Morgan-Sell

Hi @Morgan-Sell

The iterativeImputer will return a continuous value to impute NA. But some variables are categorical, so instead of regression, classification would be more suitable.

Nan are handle during the subsequent rounds of imputation, like the iterativeimputer does.

So I guess, the only difference would be that our imputer is able to distinguish when to do regression and when to do imputation. Or maybe it could even give the user the option to pass a list of categorical and numerical variables.

Also, I've read the papers a while a go, but before drafting this class, it would be good to read the paper on MICE (multivariate imputation of chained equations) and MissForest.

solegalli avatar Aug 03 '22 07:08 solegalli

Hi @solegalli,

Yeah, I read a paper on MICE. I saw there that R has a MICE package.

I'm going to table this one for the moment to focus on the other transformers. Maybe one of our wonderful collaborators will pick this one up ;)

Morgan-Sell avatar Aug 20 '22 16:08 Morgan-Sell