ydata-profiling
ydata-profiling copied to clipboard
Adding pairwise correlation table/plot to correlation section
Missing functionality
I find that staring at a correlation matrix or heatmap to be tedious. It contains the uninformative diagonal and all of the values and variable pairs are duplicated!
import numpy as np
import pandas as pd
df = pd.DataFrame(
{
"num_col1": np.random.rand(10),
"num_col2": np.random.rand(10),
"num_col3": np.random.rand(10),
}
)
df_corr_matrix = df.select_dtypes("number").corr()
# Tedious to stare at!
print(df_corr_matrix)
num_col1 num_col2 num_col3
num_col1 1.000000 -0.290111 0.359558
num_col2 -0.290111 1.000000 -0.180386
num_col3 0.359558 -0.180386 1.000000
Instead, I like to melt the correlation matrix into a table that is sorted descendingly by the absolute value of the correlation metric. I remove the uninformative diagonal as well as the duplicated variable pairs by only using the lower triangle.
In this format, I can easily see which variables are highly correlated and whether they have a negative or positive relationship.
The ease of interpretability is greatly enhanced!
nan_mask = np.triu(np.ones(df_corr_matrix.shape)).astype(bool)
df_pairwise_corr = (
df_corr_matrix
.mask(nan_mask)
.rename_axis("variable_1")
.reset_index()
.melt("variable_1", var_name="variable_2")
.dropna()
.assign(abs_value=lambda df: df.value.abs())
.sort_values("abs_value", ascending=False, ignore_index=True)
)
# Easy to interpret!
print(df_pairwise_corr)
variable_1 variable_2 value abs_value
0 num_col3 num_col1 0.359558 0.359558
1 num_col2 num_col1 -0.290111 0.290111
2 num_col3 num_col2 -0.180386 0.180386
Proposed feature
Let's add a "Pairwise Table" tab to the correlation section that will show the melted correlation matrix that is sorted descendingly by the absolute value of the correlation metric.
The table will contain these columns: variable_1, variable_2, value, abs_value
See the above example.
Alternatives considered
No response
Additional context
No response
@fabclmnt , What are you thoughts on this feature proposal? Do you think that it would enhance the interpretability of the the correlation section?
Or how about a vertical bar plot like this?
Hi @kylegilde,
I like the experience you proprose tbh, but I would leave as an option for the user to decide whether it prefers an heatmap of a pairwise type of visualization.
Putting this as requirements for development I would add an option for the correlations section so that the user can select between configurations "pairwise" or "heatmap".
For the pairwise more the user would get:
- Plot: the proposed vertical barplot
- Table: Pairwise correlation tables (presented in your suggestion)
What do you think?
That makes total sense. Thank you!