dython Adding option to drop samples in each pair of columns independently

Adding option to drop samples in each pair of columns independently

Open matbb opened this issue 1 year ago • 4 comments

Adding option to drop samples in each pair of columns independently.

Implements feature proposed in #130.

In short: when NaN values from third column interfere with calculation of correlation between two columns, dropping NaN samples in each pair of columns independently gives more insight into relationship between values.

Change result:

NaN strategy :  drop_samples
corr_df:
         A (con)  B (con)  C (con)  D (con)  E (nom)  F (nom)  X (con)
A (con)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
B (con)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
C (con)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
D (con)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
E (nom)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
F (nom)      0.0      0.0      0.0      0.0      0.0      0.0      0.0
X (con)      0.0      0.0      0.0      0.0      0.0      0.0      0.0

NaN strategy :  replace
corr_df:
          A (con)   B (con)   C (con)   D (con)   E (nom)   F (nom)   X (con)
A (con)  1.000000  0.085714 -0.657143 -0.657143  0.952277  0.691564  0.576818
B (con)  0.085714  1.000000 -0.428571 -0.428571  0.377964  0.632456  0.333947
C (con) -0.657143 -0.428571  1.000000  1.000000  0.925820  0.447214 -0.941124
D (con) -0.657143 -0.428571  1.000000  1.000000  0.925820  0.447214 -0.941124
E (nom)  0.952277  0.377964  0.925820  0.925820  1.000000  0.685331  0.842075
F (nom)  0.691564  0.632456  0.447214  0.447214  0.557886  1.000000  0.356753
X (con)  0.576818  0.333947 -0.941124 -0.941124  0.842075  0.356753  1.000000

NaN strategy :  drop_missing_sample_pairs
corr_df:
          A (con)   B (con)   C (con)   D (con)   E (nom)   F (nom)  X (con)
A (con)  1.000000  1.000000 -1.000000 -1.000000  0.924473  0.420084      1.0
B (con)  1.000000  1.000000 -1.000000 -1.000000  0.925820  0.225494      1.0
C (con) -1.000000 -1.000000  1.000000  1.000000  0.924473  0.431331     -1.0
D (con) -1.000000 -1.000000  1.000000  1.000000  0.924473  0.431331     -1.0
E (nom)  0.924473  0.925820  0.924473  0.924473  1.000000  0.311278      0.0
F (nom)  0.420084  0.225494  0.431331  0.431331  0.383689  1.000000      1.0
X (con)  1.000000  1.000000 -1.000000 -1.000000  0.000000  1.000000      1.0

from script

import dython
import pandas as pd

nan = float("nan")
# na = pd.NA
na = float("nan")

df = pd.DataFrame(
    {
        "A": [ 1., 2., 3., na,  5., 6., ],
        "B": [ 1., 2., 3., 4., nan, 6., ],
        "C": [ 9., 8., 7., 6., 5.,  4., ],
        "D": [ 9., 8., 7., 6., 5.,  4., ],
        "E": [ "a", "a", "a", na, "b", "b", ],
        "F": [ "c", "c", na, "d", "e", "c", ],
        "X": [ na, na, na, 1., 2.,  6., ],
    }
)

##
from dython.nominal import associations
import traceback
##
for nan_strategy in [ "drop_samples", "replace", "drop_missing_sample_pairs" ]:
    try:
        if nan_strategy == "drop_missing_sample_pairs":
            try:
                dython.nominal._DROP_MISSING_PAIRS
            except AttributeError:
                nan_strategy = "replace"

        print()
        print("NaN strategy : ", nan_strategy)
        d = associations(df,
                    plot=False,
                    compute_only=True,
                    nan_strategy=nan_strategy,
                    nom_nom_assoc="theil",
                    num_num_assoc="spearman",
                    mark_columns=True,
                    )
        corr_df = d["corr"]

        print("corr_df:")
        print(corr_df)

    except Exception as e:
        print("no success with nan_stragegy ", nan_strategy)
        print(e)
        print(traceback.format_exc())

Aug 16 '22 13:08 matbb

I see what you did here, nice idea. I got COVID a few days ago, so I'm a little less focused now, I'll take a deeper look once I'm better. Thanks for this :)

Aug 17 '22 15:08 shakedzy

test_datetime_data fails, and I don't understand why..

Aug 27 '22 12:08 shakedzy

Strangely enough tests pass for me (@9578dc18cd0debd106d844bc2f13d415499e3e3b), with some warnings though. Using python 3.8 from anaconda for the test.

test-output.txt

Aug 27 '22 13:08 matbb

hey @matbb - I just merged a PR that enforces using Black. This branch needs to merge master as there are conflicts

Sep 03 '22 19:09 shakedzy

So this PR needs a whole new refactor due to the new parallelizing mechanism that was added. I tried to this, but broke the whole thing..

Oct 21 '22 20:10 shakedzy

Fixed. Need to add new test for the new option that was added here.

Oct 21 '22 21:10 shakedzy

dython dython copied to clipboard

Adding option to drop samples in each pair of columns independently

dython
dython copied to clipboard