pyampute
pyampute copied to clipboard
Not clear how to prepare a dataset to apply Little's MCAR test to it
Hello! My name is Pavel, I am student in Data Science. Right now I am struggling in use of Little's MCAR test on my dataset from your pyampute package. I am trying to implement an example code from page https://rianneschouten.github.io/pyampute/build/html/pyampute.exploration.html#pyampute.exploration.mcar_statistical_tests.MCARTest.little_mcar_test
from pyampute.exploration.mcar_statistical_tests import MCARTest games.info() temp_for_mcar_test = games mt = MCARTest(method="little") print(mt.little_mcar_test(temp_for_mcar_test))
But I receive KeyError: KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['name', 'platform', 'genre', 'rating'], dtype='object'). See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
My dataset info is here:
Could you please provide a step-by-step istruction on how to prepare a dataset to implement MCAR test from pyapmute to it?
Hi @balovp Thanks for bringing this to our attention. I will take a look at what's going on and prepare a dataset for you if needed! Please give me by the end of the week and I'll have it ready.
Dear @balovp,
Thank you for reaching out to us!
Yes, our current implementation of Little's MCAR test only works for numerical data. We create a covariance matrix using pd.DataFrame.cov() and non-numerical features are not included in this matrix. Since your dataset has an incomplete object-type feature, this will give problems when asking for the covariance.
I advise you to perform Little's MCAR test only on the non-numerical features. As a side note: be aware that Little’s MCAR test is not a guarantee that your data is fully MCAR. It is merely an indication and should be used with caution. It would be good to do further analysis using missing data patterns, using histograms, and using other visualization methods to find out why there is missing data in your dataset and what you can do about it.
@davzaman we may want to adapt the default in pd.DataFrame.cov() with numerical_only = True to numerical_only = False. I am not sure whether this will cause problems later on when calculating the test statistics.
Dear @RianneSchouten , @davzaman ,
Thank you very much for such detailed answer for my newbie question.
There is more:
I have created a subset of my data with only numeric columns and then applied MCAR test. It worked stangely, it return p-value that is equals zero (I have checked it via mt.little_mcar_test(data_mcar) == 0
code, it returned True
.
Here's my code:
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = games[['year_of_release', 'na_sales', 'eu_sales', 'jp_sales', 'other_sales', 'critic_score']]
mt = MCARTest(method="little")
print(Little's MCAR p-value:', f'{mt.little_mcar_test(data_mcar):.30f}')
As far as I know, p-value cannot be equal zero, it can only be close to zero.
Could you please help me, what am I doing wrong?
@balovp
This probably has to do with rounding; python will not print an endless amount of decimals. The chi square test statistic value for your dataset is very large; the chi square cumulative distribution function approaches 1, which will be rounded to 1 by python, resulting in a p-value that is rounded to 0.
Does this answer your question?
Hi @balovp following up here, do you have any questions remaining?