StatsBase.jl
StatsBase.jl copied to clipboard
Entropy calculation with non-probability vectors
The current (v0.33.16) implementation of entropy(p)
https://github.com/JuliaStats/StatsBase.jl/blob/071d10a110d756fa15c6a6878936eee27e4c2cb6/src/scalarstats.jl#L760-L768
returns weird results if the argument isn't a probability-like vector (i.e. sum(p) != 1
)
julia> entropy([0.5,0.5],2) # correct
1.0
julia> entropy([0.5,0.25],2) # this should throw an error?
1.0
julia> entropy([0.25,0.25],2) # this too, or at least some notion of normalization of p?
1.0
julia> entropy([0.125,0.125],2) # incorrect, no idea what's happening here
0.75
What about adding a Distributions.isprobvec(p)
check in entropy
to avoid these cases?
Yes we should probably check that the sum is one, with a check
keyword argument to disable the check for performance. Throwing an error when the sum isn't one would be technically breaking, but maybe it's OK to do this in a patch release given that we would have returned nonsensical results anyway. See also https://github.com/JuliaStats/StatsBase.jl/pull/725.
Cc: @andreasnoack @matbesancon @devmotion @mschauer
I came accross this when calculating the entropy of single values (like entropy(8,2) = -24.0 Bit, nonsense). Currently my uncertainty model includes certainty and robust bounds like Union{Real, Interval, UnivariateDistribution}. For me a multiple dispatch fix works: entropy(x::Real) = 0.0
.
A better way is LinearAlgebra.normalize
-ing the input. This would yield correct results for singleton inputs (H=0) and vectors / collections of relative counts. For everything else someone has to explain the meaning of entropy in this context. (Interpret the [0.25, 0.25]
as relative counts in some weird basis, like "1//4 per hour")
...
prop = normalize(p, 1)
return -sum(xlogx, prop)
For probability vectors this changes nothing. At least when existing uses are consistent with documentation, this would be non-breaking.
Performance should be not so different from the Distributions.isprobvec(p)
variant. check
keyword is a fine idea.
Ok it's not that easy after playing with the tests.
If we don't want in-place normalization, the additional allocations (and thus performance) are an issue.
I wouldn't normalize by default anyway as this could hide bugs (just like the current situation). Better require users to be explicit.
So what would people think about
isprobvec(p::AbstractVector{<:Real}) = all(x -> x ≥ zero(x), p) && isapprox(sum(p), one(eltype(p)))
function entropy(p::AbstractVector{<:Real}; check::Bool = true)
check && (isprobvec(p) || throw(ArgumentError("Not a proper probability distribution")))
return -sum(xlogx, p)
end
entropy(p, b::Real; check::Bool = true) = entropy(p; check) / log(b)
If there's a thumbs up, I'd create a PR. This would yield
julia> entropy([0.5,0.5],2) # as before
1.0
julia> entropy([1,1],2) # throws an error
ERROR: ArgumentError: Not a proper probability distribution
julia> entropy([1,1],2,check=false) # but the check can be disabled
-0.0
julia> entropy([]) # throws an error as before, but because of dispatch constraints
ERROR: MethodError: no method matching entropy(::Vector{Any})
As you see I've taken isprobvec
from Distribtions.jl as Distributions depends on StatsBase not vice versa, I don't know how you want to deal with this otherwise. Also as isprobvec
is only defined for AbstractVector{<:Real}
(does that make sense for a matrix??) I've included a similar constrain here for entropy
.
I don't know whether the -0 here
julia> entropy([0,1])
-0.0
is usually just ignored, but I guess it would be more proper if did instead of return -sum(xlogx, p)
just
return copysign(sum(xlogx, p),0)
As the sum(xlogx, p) has to be non-positive for a probability vector anyway, so it'll only affect the -0 case where 0 is returned instead?
On the argument constraint ::AbstractVector{<:Real}
: I currently wonder about the meaning in these results
julia> entropy([1,1+im]) # complex probabilities
0.43882457311747564 - 1.131971753677421im
julia> entropy(rand(2,2)) # matrix input
1.0956946888140768
Not even in quantum mechanics are complex probabilities a thing afaik, and I'd usually also not interpret a matrix as a probabilitiy distribution, but maybe people want to calculate the entropy of a joint probability mass function? So maybe relaxing it to ::AbstractArray{<:Real}
and turning isprobvec
to isprobarray
?
There is no problem imho returning -0.0
. It's a float so there are signed versions of zero, 0.0
and -0.0
.
julia> iszero(0.0)
true
julia> iszero(-0.0)
true
julia> 0.0 == -0.0
true
julia> 0.0 === -0.0
false
For matrices it's common to interpret rows or cols as probabilities (e.g. as transition matrix for Markov Chains). There are definitions of row/right, col/left and doubly stochastic matrices, meaning rows or cols or both each sum up to one (= can be interpreted as probs). In this context the user should better do something like:
isrowstochastic(M::AbstractMatrix) = all([ isprobvec(r) for r in eachrow(M) ])
H_rows = [ entropy(r) for r in eachrow(M) ]
Joint entropy, mutual information and so on should better be defined seperately.
Complex probabilities may be a thing complex probability and information theory. Entropy extensions are in Sec 8 but I dont get it.
Constrain argument on AbstractVector{<:Real}
and wait for people smarter than me.
Joint entropy, mutual information and so on should better be defined seperately.
I agree, the entropy calculation her easily returns not what people would expect if we allowed matrices, because as you say you may interpret the row as a probability vector, or all entries together. To avoid this, and force the user to be more explicit, I'd then suggest to constrain types here to vectors so that you'd either need to do [entropy(x) for x in eachrow(M)]
as you say or entropy(vec(M))
for a joint probability matrix.