Best way to compute mutual information in high dimension when all but one variable are iid
Thanks for making this wonderful package.
I'm trying to compute the mutual information in high-dimension but the case I am interested in is exceptionally simple and hence there may be a faster method than using the built-in function.
Specifically, I have a function $f(x_{1},\dots,x_{n})$ where $n$ is large and I would like to estimate the mutual information between the random variable $F = f(X_{1},\dots,X_{n})$ and the independent and identically distributed (iid) random variables $X_{1},\dots,X_{n}$ (so $I(F,\dots,X_{n});X_{1},\dots,X_{n})$), given a large number of samples. Given the fact that all but one variable are iid, I'm hoping the calculation simplifies dramatically.
The documentation has a comment which is rather suggestive but I confess I don't really understand what is being said:
"On the other hand, in high-dimensions, KDTree’s are not very good, and you might be reduced to the speed of brute force $N^2$ search for nearest neighbors. If $N$ is too big, I recommend using multiple, random, small subsets of points ($N′ << N$) to estimate entropies, then average over them."
If I have $N$ samples of the form ${F, X_{1},\dots, X_{n} }{i}$, where $i$ runs from 1 to $N$, is this just saying that one should take a subset $M < N$ of these samples and compute the mutual information, do this multiple times, and then average them? Or is it saying to somehow construct the mutual information by some averaging of lower dimension samples ${F, X{1},\dots, X_{m} }_{i}$, where $m < n$. If the former, then why is this advantageous to using the built-in method? If the latter, then how exactly does this work?
However, this question about the documentation may not be relevant. Perhaps there is a more direct way of answering my primary question.
Thanks again for the wonderful code!
Ah....actually it seems to be extremely (read embarrassingly) simple. It seems I can just write $$ I(F;X_{1},\dots,X_{n}) = H(F)-H(F|X_{1},\dots,X_{n}) = H(F) $$ (since f(x_{1},\dots,x_{n}) is a deterministic function).
Right! If these are continuous random variables, then there is another wrinkle. H(F|X) actually goes to negative infinity, and the mutual information becomes infinite (for a noiseless function). This is talked about a little in Cover & Thomas chapter 8.
You could just use the MI estimator directly, and it would give you a number that is NOT infinity. This is a limitation of the estimator, so the number you get will not be very meaningful. We talk a little about this in this paper. If you have nearly deterministic functions, you should try NPEET_LNC.
Also, I should point out that in my documentation I was using $N$ to talk about the number of samples, not the dimensionality of the random variables.
Many thanks for this. I will certainly be making use of NPEET_LNC and I will take a closer look at your paper.