impyute icon indicating copy to clipboard operation
impyute copied to clipboard

About em problem

Open ROOKLO opened this issue 3 years ago • 3 comments

I created NAN in my data set randomly, and i want to compare the performance of EM methods in SPSS and impyute . and i got spss_em* MSE_spss: 22.177916455492653 r_spss: 0.721709731654166 impyute_em MSE_impyute: 289.1830722478248 r_impyute: 0.002467765572835078 the em from impyute seems to not work very well , and i do not know why

ROOKLO avatar Sep 20 '20 13:09 ROOKLO

I am not very clear about the details of SPSS EM implementation, but I read the source code of the em from impyute. I found that the implementation is very simple. It is to continuously resample the Gaussian distribution formed by the mean and variance of the current column until the gap with the last filling value is very small. This method may not be effective when dealing with data with more complex characteristics.

BaoxueLi avatar Jan 22 '21 02:01 BaoxueLi

I am not very clear about the details of SPSS EM implementation, but I read the source code of the em from impyute. I found that the implementation is very simple. It is to continuously resample the Gaussian distribution formed by the mean and variance of the current column until the gap with the last filling value is very small. This method may not be effective when dealing with data with more complex characteristics.

Maybe the data is not normally distributed or not missing randomly. The normal distribution formed by the mean and standard deviation of the existing data in every column(feature) could not represent the data's true distribution, and bias was introduced in the first iteration.

ROOKLO avatar Jan 22 '21 08:01 ROOKLO

I also do not think the implementation here at impyute is correct, as it does not use any covariance structure and just uses the mean and standard deviation of the current column. Murphy's "Machine Learning: a statistical perspective", chapter 11.6. shows how to use the EM-algorithm for derivating the sufficient statistics in the normal case. Is the algorithm converging actually for any delta?

mkrtl avatar Feb 25 '21 09:02 mkrtl