ydata-profiling icon indicating copy to clipboard operation
ydata-profiling copied to clipboard

Adding pairwise correlation table/plot to correlation section

Open kylegilde opened this issue 1 year ago • 4 comments

Missing functionality

I find that staring at a correlation matrix or heatmap to be tedious. It contains the uninformative diagonal and all of the values and variable pairs are duplicated!

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {
        "num_col1": np.random.rand(10),
        "num_col2": np.random.rand(10),
        "num_col3": np.random.rand(10),
    }
)
df_corr_matrix = df.select_dtypes("number").corr()

# Tedious to stare at!
print(df_corr_matrix)
          num_col1  num_col2  num_col3
num_col1  1.000000 -0.290111  0.359558
num_col2 -0.290111  1.000000 -0.180386
num_col3  0.359558 -0.180386  1.000000

Instead, I like to melt the correlation matrix into a table that is sorted descendingly by the absolute value of the correlation metric. I remove the uninformative diagonal as well as the duplicated variable pairs by only using the lower triangle.

In this format, I can easily see which variables are highly correlated and whether they have a negative or positive relationship.

The ease of interpretability is greatly enhanced!

nan_mask = np.triu(np.ones(df_corr_matrix.shape)).astype(bool)
df_pairwise_corr = (
    df_corr_matrix
    .mask(nan_mask)
    .rename_axis("variable_1")
    .reset_index()
    .melt("variable_1", var_name="variable_2")
    .dropna()
    .assign(abs_value=lambda df: df.value.abs())
    .sort_values("abs_value", ascending=False, ignore_index=True)
)

# Easy to interpret!
print(df_pairwise_corr)
  variable_1 variable_2     value  abs_value
0   num_col3   num_col1  0.359558   0.359558
1   num_col2   num_col1 -0.290111   0.290111
2   num_col3   num_col2 -0.180386   0.180386

Proposed feature

Let's add a "Pairwise Table" tab to the correlation section that will show the melted correlation matrix that is sorted descendingly by the absolute value of the correlation metric.

The table will contain these columns: variable_1, variable_2, value, abs_value

See the above example.

Alternatives considered

No response

Additional context

No response

kylegilde avatar Jun 02 '23 18:06 kylegilde

@fabclmnt , What are you thoughts on this feature proposal? Do you think that it would enhance the interpretability of the the correlation section?

kylegilde avatar Jun 13 '23 13:06 kylegilde

Or how about a vertical bar plot like this?

image

kylegilde avatar Jun 26 '23 18:06 kylegilde

Hi @kylegilde,

I like the experience you proprose tbh, but I would leave as an option for the user to decide whether it prefers an heatmap of a pairwise type of visualization.

Putting this as requirements for development I would add an option for the correlations section so that the user can select between configurations "pairwise" or "heatmap".

For the pairwise more the user would get:

  • Plot: the proposed vertical barplot
  • Table: Pairwise correlation tables (presented in your suggestion)

What do you think?

fabclmnt avatar Jun 29 '23 04:06 fabclmnt

That makes total sense. Thank you!

kylegilde avatar Jul 02 '23 22:07 kylegilde