causal-learn The performance gap bewteen `KCI_UInd` and `KCI

The issue is based on the code in Pull request #55

Here is just a weird problem with the performance gap between KCI_UInd and KCI_CInd. Intuitively, the test of $X\bot Y$ and $X\bot Y|Z=1$ (Z is a constant) should have a similar performance, or the latter test(use KCI_CInd) should have a worse performance due to it handling a more universal case. However, when I ran the code, the result is not as I excepted.

I test the code by a random collider dataset, which means $X\bot Z$, $X\equiv Y$; and I also visualize the test statistics, mean and var for convenient debugging. And the result shows a similar p-value of $X\bot Z$, $X\bot Y$ and a different p-value of $X\bot Z | 1$, $X\bot Y | 1$.

Following is my test code:

from icecream import ic
from causallearn.utils.cit import CIT
from tqdm import trange
import numpy as np


def generate_single_sample(type, dim):
    if (type == 'chain'):
        X = np.random.random(dim)
        Y = np.random.random(dim)+X
        Z = np.random.random(dim)+Y
        #X->Y->Z
    elif (type == 'collider'):
        # X->Y<-Z
        X = np.random.random(dim)
        Z = np.random.random(dim)
        Y = np.random.random(dim)+X+Z
    #Y = np.zeros(dim)+np.average(Y)
    return list(X)+list(Y)+list(Z)+[1]# 31 dim X:0..9; Y:10..19; Z:20..29; 1: 30

def generate_dataset(dim, size):
    dataset = []
    for i in range(size):
        datapoint = generate_single_sample('collider', dim)
        dataset.append(datapoint)
    dataset = np.array(dataset)
    return dataset


if __name__ == '__main__':
    dataset = generate_dataset(10, 1000)
    cit_tester = CIT(dataset, method = 'kci')
    #ic(cit_tester.kci(0, 20, []))
    # Origin version can not pass this due to the feature-30 have the similar value
    #ic(cit_tester.kci(0, 20, [30]))
    # The follow is from one of my recent requirements, which is using CIT to test high dim variables
    # Test high dim variables is not supported by current cit class, which is different from the documents,
    # so I also implement this function in the last commit.
    # An issue is related to the "CIT of test high dim variables" which I will put forward latter
    ic(cit_tester.kci(range(10), range(20,30), range(10,20)))
    ic(cit_tester.kci(range(10), range(20,30), []))
    ic(cit_tester.kci(range(10), range(10,20), []))
    ic(cit_tester.kci(range(10), range(20,30), [30]))
    ic(cit_tester.kci(range(10), range(10,20), [30]))

Jul 11 '22 13:07 cogito233

Thanks for reporting!

This is very helpful!

cc @MarkDana @aoqiz to take a look. Currently we are also aware that KCI may have problem (because we are adding tests to KCI)

You can refer to this PR and check the discussions.

https://github.com/cmu-phil/causal-learn/pull/51

Jul 11 '22 13:07 tofuwen

@cogito233 Thanks for reporting this!

The results of X;Z is reasonable, on both UInd and CInd with cond=const.

However, the results of X;Y on UInd is weird (though reasonable onCInd with cond=const): X and Y should be dependent.

This seems to be an issue about UInd over multivariate unconditional variables. If we change the dim and run UInd:

===== dim=1 =====
X;Z|Y (0.0, 9066.205401084622)
X;Z (0.29556718270406024, 539.1317250445156)
X;Y (0.0, 16264.874623118912)
===== dim=2 =====
X;Z|Y (0.0, 4557.215025272839)
X;Z (0.7249326967910759, 867.7338218304967)
X;Y (0.0, 2110.464456089926)
===== dim=3 =====
X;Z|Y (0.0, 3351.2202498199854)
X;Z (0.1513190541422521, 988.4356054402823)
X;Y (1.49993761855427e-09, 1014.1590135809473)
===== dim=4 =====
X;Z|Y (0.0, 1794.3057275057783)
X;Z (0.2556551447210166, 997.4650516234104)
X;Y (0.2803597197168265, 997.3863490478714)
===== dim=5 =====
X;Z|Y (0.0, 899.4903065524257)
X;Z (0.24524614468945394, 998.814910086335)
X;Y (0.24312389282574143, 998.8338678400082)
===== dim=6 =====
X;Z|Y (0.0, 406.97539171251213)
X;Z (0.2398808896954846, 998.9853103086572)
X;Y (0.23995980736465694, 998.9829826156974)
===== dim=7 =====
X;Z|Y (0.0, 180.6456809446278)
X;Z (0.23967820296940512, 998.9992119205756)
X;Y (0.23971332440305304, 998.9980748157069)
===== dim=8 =====
X;Z|Y (0.0, 70.9566245430464)
X;Z (0.23967704831912018, 998.9998747995652)
X;Y (0.23967691930623714, 998.9999277970797)
===== dim=9 =====
X;Z|Y (0.0, 30.999050951169558)
X;Z (0.23967673147454305, 998.9999995812599)
X;Y (0.23967673147172885, 998.9999998991132)
===== dim=10 =====
X;Z|Y (0.0, 13.118072125674262)
X;Z (0.2396767314715128, 998.9999999994757)
X;Y (0.2396767314713849, 998.9999999992579)

We'll notice reasonable results (zero pvalue) for low dimensional cases (dim=1,2,3). But for bigger dim≥5, results are unreasonable: X;Z and X;Y are almost the same. Weird.

We'll have a look at this. Thanks again :))

Jul 11 '22 13:07 MarkDana

Hi @cogito233,

I checked with expert. Our KCI code is translated from Kun [who is the first author of KCI]'s original matlab code (http://people.tuebingen.mpg.de/kzhang/KCI-test.zip), do you mind running your example on the matlab version of KCI to check whether the matlab version works or not?

Jul 12 '22 03:07 tofuwen

Hey @aoqiz, could you take a look at this issue, since you fixed a KCI bug?

I guess it's related?

Jul 21 '22 06:07 tofuwen

Hi @tofuwen @MarkDana @cogito233 , sorry for the late reply. It seems this issue is not related to the typos we fixed.

As @MarkDana said,

the results of X;Z is reasonable, on both UInd and CInd with cond=const.

However, the results of X;Y on both UInd and CInd with cond=const is weird: X and Y should be dependent.

I also agree this seems to be an issue about UInd and CInd with cond=const over multivariate unconditional variables. If we change the dim and run UInd with the code after we fix some typos:

===== dim 1 ======
X;Z|Y (0.0)
X;Z (0.7395068500337783)
X;Y (0.0)
X;Z|1 (0.7162780815205464)
X;Y|1 (0.0)
===== dim 2 ======
X;Z|Y (0.0)
X;Z (0.2465625725509405)
X;Y (0.0)
X;Z|1 (0.097182540703873)
X;Y|1 (0.0)
===== dim 3 ======
X;Z|Y (0.0)
X;Z (0.52430720476651)
X;Y (2.333289117473214e-11)
X;Z|1 (0.29715398301005747)
X;Y|1 (0.0)
===== dim 4 ======
X;Z|Y (0.0)
X;Z (0.27263594492237764)
X;Y (0.2198998343635813)
X;Z|1 (0.49178236483423254)
X;Y|1 (0.0)
===== dim 5 ======
X;Z|Y (0.0)
X;Z (0.2443953147587915)
X;Y (0.24305515590790838)
X;Z|1 (0.6007947756316929)
X;Y|1 (0.0005542517721440765)
===== dim 6 ======
X;Z|Y (0.0)
X;Z (0.23888770585386843)
X;Y (0.23984225333290754)
X;Z|1 (0.4686281109563877)
X;Y|1 (0.1917556480375482)
===== dim 7 ======
X;Z|Y (0.0)
X;Z (0.23967730291953726)
X;Y (0.23968207103708283)
X;Z|1 (0.48387886361881394)
X;Y|1 (0.4153827973057993)
===== dim 8 ======
X;Z|Y (2.233449802879761e-08)
X;Z (0.23967673274989199)
X;Y (0.23968035050484848)
X;Z|1 (0.4818506425690403)
X;Y|1 (0.46968345261009004)
===== dim 9 ======
X;Z|Y (2.7946067159279053e-06)
X;Z (0.23967673156516878)
X;Y (0.23967673150322388)
X;Z|1 (0.48489443446559577)
X;Y|1 (0.48146647652485997)
===== dim 10 ======
X;Z|Y (0.003043305128150209)
X;Z (0.239676731471257)
X;Y (0.23967673147182922)
X;Z|1 (0.4849895662924787)
X;Y|1 (0.48442408716692864)

We'll still notice reasonable results (almost zero pvalue) for low dimensional cases in UInd (dim=1,2,3) and CInd with cond=const (dim=1,2,3,4,5). But for bigger dim, results are unreasonable: X;Z and X;Y are almost the same....

We'll keep a look at this. Thanks again :))

Jul 26 '22 14:07 aoqiz

causal-learn
causal-learn copied to clipboard

The performance gap bewteen `KCI_UInd` and `KCI_CInd` under a similar setting

causal-learn causal-learn copied to clipboard

The performance gap bewteen `KCI_UInd` and `KCI_CInd` under a similar setting

causal-learn
causal-learn copied to clipboard