StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

Entropy calculation with non-probability vectors

Open milankl opened this issue 3 years ago • 11 comments

The current (v0.33.16) implementation of entropy(p) https://github.com/JuliaStats/StatsBase.jl/blob/071d10a110d756fa15c6a6878936eee27e4c2cb6/src/scalarstats.jl#L760-L768

returns weird results if the argument isn't a probability-like vector (i.e. sum(p) != 1)

julia> entropy([0.5,0.5],2)       # correct
1.0
julia> entropy([0.5,0.25],2)      # this should throw an error?
1.0
julia> entropy([0.25,0.25],2)     # this too, or at least some notion of normalization of p?
1.0
julia> entropy([0.125,0.125],2)   # incorrect, no idea what's happening here
0.75

milankl avatar Feb 21 '22 14:02 milankl

What about adding a Distributions.isprobvec(p) check in entropy to avoid these cases?

milankl avatar Feb 21 '22 14:02 milankl

Yes we should probably check that the sum is one, with a check keyword argument to disable the check for performance. Throwing an error when the sum isn't one would be technically breaking, but maybe it's OK to do this in a patch release given that we would have returned nonsensical results anyway. See also https://github.com/JuliaStats/StatsBase.jl/pull/725.

Cc: @andreasnoack @matbesancon @devmotion @mschauer

nalimilan avatar Apr 21 '22 20:04 nalimilan

I came accross this when calculating the entropy of single values (like entropy(8,2) = -24.0 Bit, nonsense). Currently my uncertainty model includes certainty and robust bounds like Union{Real, Interval, UnivariateDistribution}. For me a multiple dispatch fix works: entropy(x::Real) = 0.0.

A better way is LinearAlgebra.normalize-ing the input. This would yield correct results for singleton inputs (H=0) and vectors / collections of relative counts. For everything else someone has to explain the meaning of entropy in this context. (Interpret the [0.25, 0.25] as relative counts in some weird basis, like "1//4 per hour")

...
prop = normalize(p, 1)
return -sum(xlogx, prop) 

For probability vectors this changes nothing. At least when existing uses are consistent with documentation, this would be non-breaking. Performance should be not so different from the Distributions.isprobvec(p) variant. check keyword is a fine idea.

i9e1 avatar May 12 '23 16:05 i9e1

Ok it's not that easy after playing with the tests.
If we don't want in-place normalization, the additional allocations (and thus performance) are an issue.

i9e1 avatar May 13 '23 15:05 i9e1

I wouldn't normalize by default anyway as this could hide bugs (just like the current situation). Better require users to be explicit.

nalimilan avatar May 28 '23 11:05 nalimilan

So what would people think about

isprobvec(p::AbstractVector{<:Real}) = all(x -> x ≥ zero(x), p) && isapprox(sum(p), one(eltype(p)))

function entropy(p::AbstractVector{<:Real}; check::Bool = true) 
    check && (isprobvec(p) || throw(ArgumentError("Not a proper probability distribution")))
    return -sum(xlogx, p) 
end 
  
entropy(p, b::Real; check::Bool = true) = entropy(p; check) / log(b) 

If there's a thumbs up, I'd create a PR. This would yield

julia> entropy([0.5,0.5],2)   # as before
1.0

julia> entropy([1,1],2)    #  throws an error
ERROR: ArgumentError: Not a proper probability distribution

julia> entropy([1,1],2,check=false)     # but the check can be disabled
-0.0

julia> entropy([])    # throws an error as before, but because of dispatch constraints
ERROR: MethodError: no method matching entropy(::Vector{Any})

As you see I've taken isprobvec from Distribtions.jl as Distributions depends on StatsBase not vice versa, I don't know how you want to deal with this otherwise. Also as isprobvec is only defined for AbstractVector{<:Real} (does that make sense for a matrix??) I've included a similar constrain here for entropy.

milankl avatar May 30 '23 18:05 milankl

I don't know whether the -0 here

julia> entropy([0,1])
-0.0

is usually just ignored, but I guess it would be more proper if did instead of return -sum(xlogx, p) just

return copysign(sum(xlogx, p),0)

As the sum(xlogx, p) has to be non-positive for a probability vector anyway, so it'll only affect the -0 case where 0 is returned instead?

milankl avatar May 30 '23 18:05 milankl

On the argument constraint ::AbstractVector{<:Real}: I currently wonder about the meaning in these results

julia> entropy([1,1+im])    # complex probabilities
0.43882457311747564 - 1.131971753677421im

julia> entropy(rand(2,2))   # matrix input
1.0956946888140768

Not even in quantum mechanics are complex probabilities a thing afaik, and I'd usually also not interpret a matrix as a probabilitiy distribution, but maybe people want to calculate the entropy of a joint probability mass function? So maybe relaxing it to ::AbstractArray{<:Real} and turning isprobvec to isprobarray ?

milankl avatar May 30 '23 18:05 milankl

There is no problem imho returning -0.0. It's a float so there are signed versions of zero, 0.0 and -0.0.

julia> iszero(0.0)
true

julia> iszero(-0.0)
true

julia> 0.0 == -0.0
true

julia> 0.0 === -0.0
false

For matrices it's common to interpret rows or cols as probabilities (e.g. as transition matrix for Markov Chains). There are definitions of row/right, col/left and doubly stochastic matrices, meaning rows or cols or both each sum up to one (= can be interpreted as probs). In this context the user should better do something like:

isrowstochastic(M::AbstractMatrix) = all([ isprobvec(r) for r in eachrow(M) ])

H_rows = [ entropy(r) for r in eachrow(M) ]

Joint entropy, mutual information and so on should better be defined seperately.

i9e1 avatar May 31 '23 21:05 i9e1

Complex probabilities may be a thing complex probability and information theory. Entropy extensions are in Sec 8 but I dont get it. Constrain argument on AbstractVector{<:Real} and wait for people smarter than me.

i9e1 avatar May 31 '23 22:05 i9e1

Joint entropy, mutual information and so on should better be defined seperately.

I agree, the entropy calculation her easily returns not what people would expect if we allowed matrices, because as you say you may interpret the row as a probability vector, or all entries together. To avoid this, and force the user to be more explicit, I'd then suggest to constrain types here to vectors so that you'd either need to do [entropy(x) for x in eachrow(M)] as you say or entropy(vec(M)) for a joint probability matrix.

milankl avatar May 31 '23 22:05 milankl