dython
dython copied to clipboard
Adding option to drop samples in each pair of columns independently
Adding option to drop samples in each pair of columns independently.
Implements feature proposed in #130.
In short: when NaN values from third column interfere with calculation of correlation between two columns, dropping NaN samples in each pair of columns independently gives more insight into relationship between values.
Change result:
NaN strategy : drop_samples
corr_df:
A (con) B (con) C (con) D (con) E (nom) F (nom) X (con)
A (con) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B (con) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C (con) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D (con) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E (nom) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
F (nom) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
X (con) 0.0 0.0 0.0 0.0 0.0 0.0 0.0
NaN strategy : replace
corr_df:
A (con) B (con) C (con) D (con) E (nom) F (nom) X (con)
A (con) 1.000000 0.085714 -0.657143 -0.657143 0.952277 0.691564 0.576818
B (con) 0.085714 1.000000 -0.428571 -0.428571 0.377964 0.632456 0.333947
C (con) -0.657143 -0.428571 1.000000 1.000000 0.925820 0.447214 -0.941124
D (con) -0.657143 -0.428571 1.000000 1.000000 0.925820 0.447214 -0.941124
E (nom) 0.952277 0.377964 0.925820 0.925820 1.000000 0.685331 0.842075
F (nom) 0.691564 0.632456 0.447214 0.447214 0.557886 1.000000 0.356753
X (con) 0.576818 0.333947 -0.941124 -0.941124 0.842075 0.356753 1.000000
NaN strategy : drop_missing_sample_pairs
corr_df:
A (con) B (con) C (con) D (con) E (nom) F (nom) X (con)
A (con) 1.000000 1.000000 -1.000000 -1.000000 0.924473 0.420084 1.0
B (con) 1.000000 1.000000 -1.000000 -1.000000 0.925820 0.225494 1.0
C (con) -1.000000 -1.000000 1.000000 1.000000 0.924473 0.431331 -1.0
D (con) -1.000000 -1.000000 1.000000 1.000000 0.924473 0.431331 -1.0
E (nom) 0.924473 0.925820 0.924473 0.924473 1.000000 0.311278 0.0
F (nom) 0.420084 0.225494 0.431331 0.431331 0.383689 1.000000 1.0
X (con) 1.000000 1.000000 -1.000000 -1.000000 0.000000 1.000000 1.0
from script
import dython
import pandas as pd
nan = float("nan")
# na = pd.NA
na = float("nan")
df = pd.DataFrame(
{
"A": [ 1., 2., 3., na, 5., 6., ],
"B": [ 1., 2., 3., 4., nan, 6., ],
"C": [ 9., 8., 7., 6., 5., 4., ],
"D": [ 9., 8., 7., 6., 5., 4., ],
"E": [ "a", "a", "a", na, "b", "b", ],
"F": [ "c", "c", na, "d", "e", "c", ],
"X": [ na, na, na, 1., 2., 6., ],
}
)
##
from dython.nominal import associations
import traceback
##
for nan_strategy in [ "drop_samples", "replace", "drop_missing_sample_pairs" ]:
try:
if nan_strategy == "drop_missing_sample_pairs":
try:
dython.nominal._DROP_MISSING_PAIRS
except AttributeError:
nan_strategy = "replace"
print()
print("NaN strategy : ", nan_strategy)
d = associations(df,
plot=False,
compute_only=True,
nan_strategy=nan_strategy,
nom_nom_assoc="theil",
num_num_assoc="spearman",
mark_columns=True,
)
corr_df = d["corr"]
print("corr_df:")
print(corr_df)
except Exception as e:
print("no success with nan_stragegy ", nan_strategy)
print(e)
print(traceback.format_exc())
I see what you did here, nice idea. I got COVID a few days ago, so I'm a little less focused now, I'll take a deeper look once I'm better. Thanks for this :)
test_datetime_data
fails, and I don't understand why..
Strangely enough tests pass for me (@9578dc18cd0debd106d844bc2f13d415499e3e3b), with some warnings though. Using python 3.8 from anaconda for the test.
hey @matbb - I just merged a PR that enforces using Black. This branch needs to merge master
as there are conflicts
So this PR needs a whole new refactor due to the new parallelizing mechanism that was added. I tried to this, but broke the whole thing..
Fixed. Need to add new test for the new option that was added here.