entropy_estimators
entropy_estimators copied to clipboard
Mutual information is greater than information entropy
Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows
import numpy as np
# np.random.seed(1)
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)
from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
from entropy_estimators import continuous
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)
Hi, thanks for raising the issue. Both packages implement the Kozachenko-Leonenko estimator for entropy and the Kraskov et al estimator for mutual information. At first I thought the difference might come down to the parameterisation but then I noticed that you already matched the k
parameter. I then dug into their source code a bit. There is a minor implementation difference as they add some small noise by default. However, that is unlikely to explain the rather large discrepancy. The major discrepancy and (to me) smoking gun is in this line:
mi = (
digamma(n_samples)
+ digamma(n_neighbors)
- np.mean(digamma(nx + 1))
- np.mean(digamma(ny + 1))
)
If you read the paper, you will notice that the correct equation (equation 8) is instead:
mi = digamma(k) - np.mean(digamma(nx+1) + digamma(ny+1)) + digamma(n)
I tested my implementation against distributions for which the mutual information can be computed analytically, so I am fairly sure that this equation is not only the intended but also the correct one.
Tl;dr: they may have put some parentheses in the wrong place.
Sorry, I didn't find the difference between eq. (8) in the paper and mi (line 70 of _mutual_infor.py) in sklearn. What's the problem with brackets (parentheses)? Or I didn't understand what you mean.
np.mean(digamma(nx+1) + digamma(ny+1)) != np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))
The expression in the paper includes the left-hand term, the code in scikit-learn the term on the right.
According to the nature of expectation, the expectation of sum is equal to the sum of expectation. So
np.mean(digamma(nx+1) + digamma(ny+1)) == np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))
Only if nx and ny are uncorrelated. Which they are not.
Might be having a bit of brain fart though. I am having a cold, and every thought takes ages.
I think you are right. Had to run some numbers on the ipython prompt to help my reduced mental capacities understand basic math again. In that case, I don't know where the difference comes from, at least not today. I will look into it when my brain is in working conditions again.
Thank you for your careful reply. Good health is the first. I wish you a speedy recovery. Maybe you'll know the answer when you recover from a cold.
Actually, I don't think there is a difference at all. The definitional or so-called "naive" estimator of the mutual information is:
I(X;Y) = H(X) + H(Y) - H(X,Y)
If we plug in your example:
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous
np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)
print(mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False, n_neighbors=3))
# [2.31452164]
hx = continuous.get_h(x, k=3)
hy = continuous.get_h(y, k=3)
hxy = continuous.get_h(np.c_[x, y], k=3)
mi = hx + hy - hxy
print(mi)
# 2.325853446732216
I would say those estimates are pretty close, given that we are using two different methods to get the result.
I find mi>hx, which puzzles me.
That continues to be a strong point.
However, I am by now fully convinced that the entropy computations are fine:
import scipy.stats as st
from entropy_estimators import continuous
distribution = st.norm(0, 1)
analytical = distribution.entropy()
empirical = continuous.get_h(distribution.rvs(10000), k=3)
print(analytical, empirical)
# 1.4189385332046727 1.4197821857006883
This leaves the following options:
- The scikit-learn maintainers and I independently made the same mistake when implementing the KSG estimator for mutual information.
- The KSG estimator is wrong. This is somewhat true as there is a bias as discussed in the paper itself (Fig. 4 - Fig. 8). However, when I implemented it, I checked it extensively against analytical results, and I don't remember seeing large deviations for Gaussian distributions.
- We are both not thinking particularly clearly today.
My money is on option 3.
Actually, option 1 and 2 are ruled out as possible explanations by the fact that the naive estimator that I implemented above returns the same result for the mutual information as the KSG estimator...
Thanks a lot. I'll take the time to study it again.
Mutual information is not necessarily less than information entropy. I was misled by a picture on Wikipedia.
why Mutual information is not necessarily less than information entropy? Recently, i also meet the problem, mi is larger than entropy?
mi is larger than entropy. This is indeed a serious problem.
The fact that the mutual information can be larger as the entropy is due to the fact that the continuous Shannon entropy formula,
$$ H(X) = - \int \rm{d}x p(x) \log p(x)$$
is not correct. Jaynes found that the correct formula should be,¹
$$ H(X) = - \int \rm{d}x p(x) \log \frac{p(x)}{m(x)}$$
instead. This way, the entropy becomes invariant on scaling $X\to cX$ or shifting the data $X\to c + X$. Kraskov et al. where using the upper definition of Shannon, which introduces a term that depends directly on the scaling. The entropy estimator is given by
$$H(X) = \psi(N) - \psi(k) + d \langle \log \varepsilon \rangle $$
where the last term $\langle \log \varepsilon \rangle$ can become infinite small by transforming $X\to cX$ with $c\to 0$ (so all nearest neighbor distances getting smaller).
A first order fix to make the entropy more meaningful is to use scaler, e.g. RobustScaler, before performing entropy estimation. Nevertheless, this does not ensure that $H(X,Y) \ge I(X,Y)$!
For exmple in
import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous
np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)
How can H(X,Y)
equals to the following?
hxy = continuous.get_h(np.c_[x, y], k=3)
The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume? Or how can we create a norm joint distribution from 2 non-normal margina distribution?
@singledoggy You seem to have a separate question. Please open another issue.
Sorry, I was thinking about this and suddenly jumped to another topic, and I didn't realize.
I think the scaling is right cause. We can see the h(x), h(y) would be effected by scale and mi(x,y ) won't.
import numpy as np
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)
from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)
print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put 1.4006482292373557 1.428777538316779 [2.37495851] 2.170190377210962
import numpy as np
x = np.random.standard_normal(1000) / 100
y = x + 0.1 * np.random.standard_normal(x.shape) / 100
from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)
print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put -3.1608263267673697 -3.1215943346483987 [2.43274451] 2.24871532950979
Another exmaple is 2 independnt normal distribution, as the $\sigma$ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

Another exmaple is 2 independnt normal distribution, as the σ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.
That’s a nice intuitive counterexample! (I propose closing this issue with that.)
Hi everyone, to follow up on this discussion. As @singledoggy and I have explained, the reason that the mutual information of continuous variables can be larger than the entropies is that taking the limit $|\mathcal{X}|\to\infty$, the Shannon entropy $H(X) = \sum_{x\in\mathcal{X}} p_x \ln p_x$ does not converge to $H(X) = \int \mathrm{d}x p(x) \ln p(x)$. The probability density is a dimensional quantity, and therefore the logarithm of it, $\ln p(x)$, is not defined. Changing the units changes the values.
For anyone who actually needs to estimate a normalized mutual information and it is not enough to know why the Kraskov estimator fails: I've recently published an article, arXiv:2405.04980, where I discuss this issue in detail and present a generalization of the Kraskov estimator that is able to estimate normalized mutual information as well. The source code is available on github at moldyn/NorMI.
@braniii Thanks for taking the time to explain the issue clearly and providing an alternative. I have linked your last comment at the top of the README to direct people your way.