beyond_correlation
beyond_correlation copied to clipboard
Added Mutual Information to available correlations (Issue #10)
I have added a function to calculate mutual information between columns of a DataFrame, and modified the discover function to enable its use.
Hi @PeteBleackley, many thanks for your interest. I'm at PyConUK this weekend, I did reply from my phone but it doesn't seem to have been posted here.
You've used the classification MI function, I suspect we also need to include the regression equivalent: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression
What dataset did you test this on? I'm wondering what it means to use only the classification version? What's your experience with MI (I have none!)?
I've generally used MI on discrete data in the past, but I find information theory based algorithms interesting in general. I'll have to comment out line 22 to test - unfortunately, Ubuntu won't let me upgrade Python above 3.5, which is a bit daft. However, I'll have a go on Monday. According to the scikit-learn docs, we should use mutual_info_classif if the target column in discrete and mutual_info_regression if it's continuous. What would be the best way to test for that? Using the dtype of the column is the most obvious way, but probably not the most reliable.
I've started doing some testing on the Boston dataset, and it looks like we need to use sklearn.utils.multiclass.type_of_target to determine whether to use mutual_info_classif or mutual_info_regression