Numerical instability in student t distribution
Hello,
thanks for creating the Distributions.jl package. I used the Student t distribution in a likelihood function and I noticed some numerical instability for large values of the degrees of freedom. I am not sure if this is expected behaviour.
Best, Felix
using Distributions
using Statistics
using CairoMakie
############# quantile function is numerically unstable for large degrees of freedom
ν = 10.0 .^ LinRange(0, 20, 200)
d = TDist.(ν)
y = quantile.(d, 0.75)
lines(ν, y; axis = (; ylabel = "0.75 quantile",
xlabel = "degrees of freedom ν",
xscale = log10))
############# likelihood/logpdf is numerically unstable for large degrees of freedom
data = rand.(TDist(100.0), 100)
νs = 10.0 .^ LinRange(0, 20, 200)
lls = Array{Float64}(undef, length(νs))
for (i,ν) in enumerate(νs)
d = TDist(ν)
lls[i] = sum(logpdf.(d, data))
end
lines(νs, lls; axis = (; ylabel = "log likelihood",
xlabel = "degrees of freedom ν",
xscale = log10))
I've transfered this issue to StatsFuns since the issue is in https://github.com/JuliaStats/StatsFuns.jl/blob/5e866baa763ff0f5d279185c1923e1cd87ea5699/src/distrs/fdist.jl#L21-L29 where we make use of the relationship between the F and Beta distributions. However, this relationship becomes numerically unstable when the second degrees of freedom parameter grows very large. I think the solution should be to switch to the χ² for some threshold of the second degrees of freedom parameter. This is of course identical to switching to the normal distribution when considering the T-distribution but doing it for the F-distribution will be more general. @FelixNoessler would you be able to prepare a PR? I believe it should be something like
ff = Symbol("fdist"*f)
bf = Symbol("beta"*f)
cf = Symbol("chisq"*f)
@eval function $ff(ν1::T, ν2::T, y::T) where {T<:Real}
if ν2 > some_large_value
return $cf(ν1, y)/ν1
else
x = $bf(ν1/2, ν2/2, y)
return x/(1 - x)*ν2/ν1
end
end
Thanks for your response! Your answer indeed fixes the issue for the quantile function.
However, there a remaining numerical instabilities for other functions of the student t distribution. For example here in line 7, we should switch with high value of the degrees of freedom (lower than infinity) to the normal distribution, this fixes the issue with the log pdf that I have also showed above: https://github.com/JuliaStats/StatsFuns.jl/blob/5e866baa763ff0f5d279185c1923e1cd87ea5699/src/distrs/tdist.jl#L6-L10
Should I address all this in one PR and use the same threshold value? Should I write a comment in the code why we use a threshold value to switch to the normal/chi^2 distribution?
I will test it a bit to get a good threshold value. I am happy to prepare a PR. I haven't done it often, but I will try and will be happy to get feedback.
Great. You can do it in one or two PRs. Either is fine. Adding a comment is almost always a good idea.