beyond_correlation Added Mutual Information to available correlations (Issue #10)

Added Mutual Information to available correlations (Issue #10)

Open PeteBleackley opened this issue 6 years ago • 3 comments

I have added a function to calculate mutual information between columns of a DataFrame, and modified the discover function to enable its use.

Sep 13 '18 19:09 PeteBleackley

Hi @PeteBleackley, many thanks for your interest. I'm at PyConUK this weekend, I did reply from my phone but it doesn't seem to have been posted here.

You've used the classification MI function, I suspect we also need to include the regression equivalent: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html#sklearn.feature_selection.mutual_info_regression

What dataset did you test this on? I'm wondering what it means to use only the classification version? What's your experience with MI (I have none!)?

Sep 15 '18 16:09 ianozsvald

I've generally used MI on discrete data in the past, but I find information theory based algorithms interesting in general. I'll have to comment out line 22 to test - unfortunately, Ubuntu won't let me upgrade Python above 3.5, which is a bit daft. However, I'll have a go on Monday. According to the scikit-learn docs, we should use mutual_info_classif if the target column in discrete and mutual_info_regression if it's continuous. What would be the best way to test for that? Using the dtype of the column is the most obvious way, but probably not the most reliable.

Sep 15 '18 20:09 PeteBleackley

I've started doing some testing on the Boston dataset, and it looks like we need to use sklearn.utils.multiclass.type_of_target to determine whether to use mutual_info_classif or mutual_info_regression

Sep 17 '18 13:09 PeteBleackley

beyond_correlation beyond_correlation copied to clipboard

Added Mutual Information to available correlations (Issue #10)

beyond_correlation
beyond_correlation copied to clipboard