confound_prediction icon indicating copy to clipboard operation
confound_prediction copied to clipboard

Generalization to more than 1 confounding factor

Open Rachine opened this issue 4 years ago • 1 comments

Hello, Thank you very much for tackling this issue of confounders, which seems very recurrent in clinical ML problems.

I have some questions about the project/paper:

  1. I am wondering why only the test set needs to be Deconfounded? Why not build also a train set which is Deconfounded and a Deconfounded test set (with no data leakage of course)?
  2. I tried to make a generalization of your methodology with k multiple confounders image I still used most of your codebase and I used a pseudo generalization of the mutual information of multiple variables. The probability to be sampled m_i which was image

is now:

image

The quantity image can still be estimated with kernel density estimation.

I made some quick toy examples, it seems to approximately work on simple additive toy examples and when the number of example is sufficient: For instance with 1000 sample and 10 confounding factors i got: image For instance with 100 sample and 3 confounding factors i got:

image

It would be also interesting to study the required N to be sure at a certain level the deconfounding capability for k factors considering the type of link.

Do you think this is a correct approach and generalization?

Thank you

Best regards

Rachine avatar Jul 09 '20 14:07 Rachine

Oops, after some thinking maybe I should look at the goodness of fit with the multiple variable and not only individual correlations, to test image

image

image I added the R^2 when I do a Ordinary Least Squares with stats model 'y ~ z0 + z1 + z2'

Rachine avatar Jul 10 '20 08:07 Rachine