causal-learn icon indicating copy to clipboard operation
causal-learn copied to clipboard

Please clarify how mixed data are to be represented in causal-learn.

Open jdramsey opened this issue 2 years ago • 7 comments

This is a request for information. For the mixed data project, could you clarify how mixed data are to be represented in causal-learn? That is, how is one to know, programmatically, which columns are for discrete data and which for continuous data? This cannot be gleaned from an np array itself, since binary data, for instance, can be treated as either continuous (with values 0 and 1) or discrete, and ordinal discrete data may often be treated as either continuous or discrete as well.

jdramsey avatar Jun 28 '22 17:06 jdramsey

Hi Joe,

Thank you for your question! There is no specific representation for mixed data in causal-learn now. A solution might be to introduce a new variable to indicate which columns are discrete data and which columns are continuous data.

chenweiDelight avatar Jun 29 '22 14:06 chenweiDelight

Thanks Joe for raising this and thanks Wei for the answer! In the near future, we would like to also include pandas.dataframe as the input, which naturally supports different data types. Also, we would like to make the graph representation consistent with networkx, which is more consistent with other python packages for downstream tasks.

kunwuz avatar Jun 29 '22 14:06 kunwuz

Hi @kunwuz @chenweiDelight Bryan is a little short on time these days but my Python is getting better, maybe I can help get the DG score into causallearn. But maybe after the pandas.dataframe refactoring it done, for minimal duplication of effort?

jdramsey avatar Jul 02 '22 17:07 jdramsey

Thanks so much, Joe! We will first work on the pandas.dataframe refactoring to represent mixed data better. We are very grateful for your willingness to help. In case you have multiple commitments and too many working loads, perhaps we could first try to have a prelim version of 'causal-learn-adapted' DG according to some existing implementations of yours/Byran's with your guidance and review?

kunwuz avatar Jul 02 '22 17:07 kunwuz

Sure! I'll get the code and send it along.

jdramsey avatar Jul 02 '22 18:07 jdramsey

I think one potential way for differentiate them is by using different dtypes (integer type for discrete variables and float type for continuous variables) as in https://github.com/phlippe/ENCO

tonyabracadabra avatar Oct 04 '22 08:10 tonyabracadabra

I agree--recent development--for another project (https://github.com/cmu-phil/py-tetrad) we've also been using dtypes for pandas data frames to distinguish continuous from discrete columns.

def pandas_to_tetrad(df: DataFrame, int_as_cont=False):
    dtypes = ["float16", "float32", "float64"]
    if int_as_cont:
        for i in range(3, 7):
            dtypes.append(f"int{2**i}")
            dtypes.append(f"uint{2**i}")
    cols = df.columns
    discrete_cols = [col for col in cols if df[col].dtypes not in dtypes]
    category_map = {col: {val: i for i, val in enumerate(df[col].unique())} for col in discrete_cols}
    df = df.replace(category_map)
    values = df.values
    n, p = df.shape

    variables = util.ArrayList()
    for col in cols:
        if col in discrete_cols:
            categories = util.ArrayList()
            for category in category_map[col]:
                categories.add(str(category))
            variables.add(td.DiscreteVariable(str(col), categories))
        else:
            variables.add(td.ContinuousVariable(str(col)))

    if len(discrete_cols) == len(cols):
        databox = td.IntDataBox(n, p)
    elif len(discrete_cols) == 0:
        databox = td.DoubleDataBox(n, p)
    else:
        databox = td.MixedDataBox(variables, n)

    for col, var in enumerate(values.T):
        for row, val in enumerate(var):
            databox.set(row, col, val)

    return td.BoxDataSet(databox, variables)

We're able to load datasets using read_csv in Python the usual way, set their dtypes where necessary, and use the above method to distinguish continuous from discrete columns correctly (for another project that requires it)--e.g.:

df = pd.read_csv("resources/auto-mpg.data.mixed.max.3.categories.txt", sep="\t")
df = df.astype({col: "float64" for col in df.columns if col != "origin"})

This is the dataset in question:

https://raw.githubusercontent.com/cmu-phil/example-causal-datasets/main/real/auto-mpg/data/auto-mpg.data.mixed.max.3.categories.txt

You can see it's necessary to set the dtypes here because some of the continuous columns look to be of some integer type. Only 'origin' is a discrete variable.

jdramsey avatar Mar 17 '23 14:03 jdramsey