entropy_estimators Mutual information is greater than information entropy

Thank you very much. If you can explain why mutual information is greater than information entropy, the code is as follows

import numpy as np
# np.random.seed(1)
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)

from entropy_estimators import continuous
infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

Dec 12 '21 04:12 lupupu

Hi, thanks for raising the issue. Both packages implement the Kozachenko-Leonenko estimator for entropy and the Kraskov et al estimator for mutual information. At first I thought the difference might come down to the parameterisation but then I noticed that you already matched the k parameter. I then dug into their source code a bit. There is a minor implementation difference as they add some small noise by default. However, that is unlikely to explain the rather large discrepancy. The major discrepancy and (to me) smoking gun is in this line:

mi = (
    digamma(n_samples)
    + digamma(n_neighbors)
    - np.mean(digamma(nx + 1))
    - np.mean(digamma(ny + 1))
)

If you read the paper, you will notice that the correct equation (equation 8) is instead:

mi = digamma(k) - np.mean(digamma(nx+1) + digamma(ny+1)) + digamma(n)

I tested my implementation against distributions for which the mutual information can be computed analytically, so I am fairly sure that this equation is not only the intended but also the correct one.

Tl;dr: they may have put some parentheses in the wrong place.

Dec 13 '21 11:12 paulbrodersen

Sorry, I didn't find the difference between eq. (8) in the paper and mi (line 70 of _mutual_infor.py) in sklearn. What's the problem with brackets (parentheses)? Or I didn't understand what you mean.

Dec 13 '21 12:12 lupupu

np.mean(digamma(nx+1) + digamma(ny+1)) != np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

The expression in the paper includes the left-hand term, the code in scikit-learn the term on the right.

Dec 13 '21 13:12 paulbrodersen

According to the nature of expectation, the expectation of sum is equal to the sum of expectation. So

np.mean(digamma(nx+1) + digamma(ny+1)) == np.mean(digamma(nx+1)) + np.mean(digamma(ny+1))

Dec 13 '21 13:12 lupupu

Only if nx and ny are uncorrelated. Which they are not.

Dec 13 '21 13:12 paulbrodersen

Might be having a bit of brain fart though. I am having a cold, and every thought takes ages.

Dec 13 '21 13:12 paulbrodersen

I think you are right. Had to run some numbers on the ipython prompt to help my reduced mental capacities understand basic math again. In that case, I don't know where the difference comes from, at least not today. I will look into it when my brain is in working conditions again.

Dec 13 '21 13:12 paulbrodersen

Thank you for your careful reply. Good health is the first. I wish you a speedy recovery. Maybe you'll know the answer when you recover from a cold.

Dec 13 '21 13:12 lupupu

Actually, I don't think there is a difference at all. The definitional or so-called "naive" estimator of the mutual information is:

I(X;Y) = H(X) + H(Y) - H(X,Y)

If we plug in your example:

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

print(mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False, n_neighbors=3))
# [2.31452164]

hx = continuous.get_h(x, k=3)
hy = continuous.get_h(y, k=3)
hxy = continuous.get_h(np.c_[x, y], k=3)

mi = hx + hy - hxy
print(mi)
# 2.325853446732216

I would say those estimates are pretty close, given that we are using two different methods to get the result.

Dec 13 '21 13:12 paulbrodersen

I find mi>hx, which puzzles me.

Dec 13 '21 14:12 lupupu

That continues to be a strong point.

However, I am by now fully convinced that the entropy computations are fine:

import scipy.stats as st
from entropy_estimators import continuous
distribution = st.norm(0, 1)
analytical = distribution.entropy()
empirical = continuous.get_h(distribution.rvs(10000), k=3)
print(analytical, empirical)
# 1.4189385332046727 1.4197821857006883

This leaves the following options:

The scikit-learn maintainers and I independently made the same mistake when implementing the KSG estimator for mutual information.
The KSG estimator is wrong. This is somewhat true as there is a bias as discussed in the paper itself (Fig. 4 - Fig. 8). However, when I implemented it, I checked it extensively against analytical results, and I don't remember seeing large deviations for Gaussian distributions.
We are both not thinking particularly clearly today.

My money is on option 3.

Dec 13 '21 16:12 paulbrodersen

Actually, option 1 and 2 are ruled out as possible explanations by the fact that the naive estimator that I implemented above returns the same result for the mutual information as the KSG estimator...

Dec 13 '21 16:12 paulbrodersen

Thanks a lot. I'll take the time to study it again.

Dec 14 '21 06:12 lupupu

Mutual information is not necessarily less than information entropy. I was misled by a picture on Wikipedia.

Dec 17 '21 12:12 lupupu

Screenshot_20211211_223137_org wikipedia

Dec 17 '21 12:12 lupupu

why Mutual information is not necessarily less than information entropy? Recently, i also meet the problem, mi is larger than entropy?

Jan 21 '22 16:01 shyshy903

mi is larger than entropy. This is indeed a serious problem.

Mar 23 '22 06:03 lupupu

The fact that the mutual information can be larger as the entropy is due to the fact that the continuous Shannon entropy formula,

$$ H(X) = - \int \rm{d}x p(x) \log p(x)$$

is not correct. Jaynes found that the correct formula should be,¹

$$ H(X) = - \int \rm{d}x p(x) \log \frac{p(x)}{m(x)}$$

instead. This way, the entropy becomes invariant on scaling $X\to cX$ or shifting the data $X\to c + X$. Kraskov et al. where using the upper definition of Shannon, which introduces a term that depends directly on the scaling. The entropy estimator is given by

$$H(X) = \psi(N) - \psi(k) + d \langle \log \varepsilon \rangle $$

where the last term $\langle \log \varepsilon \rangle$ can become infinite small by transforming $X\to cX$ with $c\to 0$ (so all nearest neighbor distances getting smaller).

A first order fix to make the entropy more meaningful is to use scaler, e.g. RobustScaler, before performing entropy estimation. Nevertheless, this does not ensure that $H(X,Y) \ge I(X,Y)$!

Jun 22 '22 13:06 braniii

For exmple in

import numpy as np
from sklearn.feature_selection import mutual_info_regression
from entropy_estimators import continuous

np.random.seed(1)
x = np.random.standard_normal(10000)
y = x + 0.1 * np.random.standard_normal(x.shape)

How can H(X,Y) equals to the following?

hxy = continuous.get_h(np.c_[x, y], k=3)

The H(X,Y) is from a 10000*10000 Dimension vector, and you can't just get a joint distribution from marginal distributions, so typically you have to assume it is a multivarible normal distibution, and that's what you do when marginal distributions is normal. But if the marginal distribution is not normal, what can we assume? Or how can we create a norm joint distribution from 2 non-normal margina distribution?

Mar 01 '23 13:03 singledoggy

@singledoggy You seem to have a separate question. Please open another issue.

Mar 06 '23 09:03 paulbrodersen

Sorry, I was thinking about this and suddenly jumped to another topic, and I didn't realize. I think the scaling is right cause. We can see the h(x), h(y) would be effected by scale and mi(x,y ) won't.

import numpy as np
x = np.random.standard_normal(1000)
y = x + 0.1 * np.random.standard_normal(x.shape)

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  1.4006482292373557 1.428777538316779 [2.37495851] 2.170190377210962

import numpy as np
x = np.random.standard_normal(1000) / 100
y = x + 0.1 * np.random.standard_normal(x.shape) / 100

from sklearn.feature_selection import mutual_info_regression
import continuous
mi_reg = mutual_info_regression(x.reshape(-1, 1), y.reshape(-1, ), discrete_features=False)
mi_this = continuous.get_mi(x.reshape(-1, 1), y.reshape(-1, ))

infor_ent_x = continuous.get_h(x, k=3)
infor_ent_y = continuous.get_h(y, k=3)

print(infor_ent_x, infor_ent_y, mi_reg, mi_this)
# out put  -3.1608263267673697 -3.1215943346483987 [2.43274451] 2.24871532950979

Mar 07 '23 07:03 singledoggy

Another exmaple is 2 independnt normal distribution, as the $\sigma$ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

Mar 07 '23 08:03 singledoggy

Another exmaple is 2 independnt normal distribution, as the σ changes their entropy can change from negative to positive but mutual information would be 0, so it can be larger or smaller or equal.

That’s a nice intuitive counterexample! (I propose closing this issue with that.)

Nov 28 '23 19:11 mahlzahn

Hi everyone, to follow up on this discussion. As @singledoggy and I have explained, the reason that the mutual information of continuous variables can be larger than the entropies is that taking the limit $|\mathcal{X}|\to\infty$, the Shannon entropy $H(X) = \sum_{x\in\mathcal{X}} p_x \ln p_x$ does not converge to $H(X) = \int \mathrm{d}x p(x) \ln p(x)$. The probability density is a dimensional quantity, and therefore the logarithm of it, $\ln p(x)$, is not defined. Changing the units changes the values.

For anyone who actually needs to estimate a normalized mutual information and it is not enough to know why the Kraskov estimator fails: I've recently published an article, arXiv:2405.04980, where I discuss this issue in detail and present a generalization of the Kraskov estimator that is able to estimate normalized mutual information as well. The source code is available on github at moldyn/NorMI.

May 14 '24 08:05 braniii

@braniii Thanks for taking the time to explain the issue clearly and providing an alternative. I have linked your last comment at the top of the README to direct people your way.

May 14 '24 11:05 paulbrodersen

entropy_estimators entropy_estimators copied to clipboard

Mutual information is greater than information entropy

entropy_estimators
entropy_estimators copied to clipboard