pyampute icon indicating copy to clipboard operation
pyampute copied to clipboard

Not clear how to prepare a dataset to apply Little's MCAR test to it

Open balovp opened this issue 2 years ago • 5 comments

Hello! My name is Pavel, I am student in Data Science. Right now I am struggling in use of Little's MCAR test on my dataset from your pyampute package. I am trying to implement an example code from page https://rianneschouten.github.io/pyampute/build/html/pyampute.exploration.html#pyampute.exploration.mcar_statistical_tests.MCARTest.little_mcar_test

from pyampute.exploration.mcar_statistical_tests import MCARTest games.info() temp_for_mcar_test = games mt = MCARTest(method="little") print(mt.little_mcar_test(temp_for_mcar_test))

But I receive KeyError: KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['name', 'platform', 'genre', 'rating'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

My dataset info is here: image

Could you please provide a step-by-step istruction on how to prepare a dataset to implement MCAR test from pyapmute to it?

balovp avatar Oct 08 '22 14:10 balovp

Hi @balovp Thanks for bringing this to our attention. I will take a look at what's going on and prepare a dataset for you if needed! Please give me by the end of the week and I'll have it ready.

davzaman avatar Oct 12 '22 18:10 davzaman

Dear @balovp,

Thank you for reaching out to us!

Yes, our current implementation of Little's MCAR test only works for numerical data. We create a covariance matrix using pd.DataFrame.cov() and non-numerical features are not included in this matrix. Since your dataset has an incomplete object-type feature, this will give problems when asking for the covariance.

I advise you to perform Little's MCAR test only on the non-numerical features. As a side note: be aware that Little’s MCAR test is not a guarantee that your data is fully MCAR. It is merely an indication and should be used with caution. It would be good to do further analysis using missing data patterns, using histograms, and using other visualization methods to find out why there is missing data in your dataset and what you can do about it.

@davzaman we may want to adapt the default in pd.DataFrame.cov() with numerical_only = True to numerical_only = False. I am not sure whether this will cause problems later on when calculating the test statistics.

RianneSchouten avatar Oct 13 '22 09:10 RianneSchouten

Dear @RianneSchouten , @davzaman ,

Thank you very much for such detailed answer for my newbie question. There is more: I have created a subset of my data with only numeric columns and then applied MCAR test. It worked stangely, it return p-value that is equals zero (I have checked it via mt.little_mcar_test(data_mcar) == 0 code, it returned True.

Here's my code: from pyampute.exploration.mcar_statistical_tests import MCARTest data_mcar = games[['year_of_release', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'critic_score']] mt = MCARTest(method="little") print(Little's MCAR p-value:', f'{mt.little_mcar_test(data_mcar):.30f}')

As far as I know, p-value cannot be equal zero, it can only be close to zero.

Could you please help me, what am I doing wrong?

balovp avatar Oct 15 '22 14:10 balovp

@balovp

This probably has to do with rounding; python will not print an endless amount of decimals. The chi square test statistic value for your dataset is very large; the chi square cumulative distribution function approaches 1, which will be rounded to 1 by python, resulting in a p-value that is rounded to 0.

Does this answer your question?

RianneSchouten avatar Oct 17 '22 11:10 RianneSchouten

Hi @balovp following up here, do you have any questions remaining?

davzaman avatar Dec 16 '22 22:12 davzaman