HypothesisTests.jl icon indicating copy to clipboard operation
HypothesisTests.jl copied to clipboard

Shapiro-Wilk normality test

Open kalmarek opened this issue 7 years ago • 46 comments

implements ShapiroWilkTest following

PATRICK ROYSTON Approximating the Shapiro-Wilk W-test for non-normality Statistics and Computing (1992) 2, 117-119 DOI: 10.1007/BF01891203

This is work in progress, all comments are welcome! I tried to follow the paper closely, but copied e.g. computed constants from the original swilk.f. Please let me know if You are ok (license-wise) with this.

These polynomial interpolations are computed at loadtime for speed. Currently

julia> using BenchmarkTools
julia> srand(1);
julia> k = 5000;
julia> X = sort(randn(k));
julia> @btime ShapiroWilkTest(X);
  112.523 μs (18 allocations: 137.45 KiB)
julia> swc = SWCoeffs(k);
julia> @btime ShapiroWilkTest(X, swc);
  62.476 μs (11 allocations: 78.50 KiB)

whereas calling swilkfort directly

julia> srand(1);
julia> X = sort(randn(5000));
julia> A = zeros(2500);
julia> @btime swilkfort!(X, A);
  205.763 μs (10 allocations: 224 bytes)
julia> @btime swilkfort!(X, A, false);
  75.692 μs (10 allocations: 224 bytes)

Still missing:

  • [x] documentation
  • [ ] censored data
  • [ ] polynomial interpolation with better precision (?),

and probably more.

I tried to compute exact values of SWCoeffs (via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 op. cit. Woule be glad if anyone could help.

> function approximate_A(N, samps)
    s = [sort(randn(N)) for i in 1:samps]
    m = sum(s[i][1:N] for i in 1:samps)/samps
    ss = vcat(s...)
    Vinv = inv(cov(ss, ss))
    A = (-m'*Vinv/sqrt(m'*Vinv*Vinv*m))'
    return ((A - reverse(A))/2)[1:div(N,2)]
end;

approximate_A(10,1000000) results in

[0.5469493640066541, 0.3559315327467397, 0.23322491989287789, 0.1336531031109471, 0.043609401791162974]

Compare swilk.f's, and SWCoeffs(10).A:

[0.5737146934337275, 0.3289700603561813, 0.21434902886647542, 0.12279063037794058, 0.04008871216291867]
[0.5737147066903874, 0.3289700464878161, 0.21434901803439887, 0.1227906248657577,  0.04008871105102476]

Rahman & Govindarajulu

A different implementation of SWCoeffs following

M. Mahibbur Rahman & Z. Govindarajulu,
A modification of the test of Shapiro and Wilk for normality
Journal of Applied Statistics 24:2, 219-236, 1997
DOI: 10.1080/02664769723828

function HypothesisTests.SWCoeffs(N::Int, ::Type{Val{:Rahman}})
    M = [norminvcdf(i/(N+1)) for i in 1:N];
    normM = normpdf.(M)

    c = (N+1)*(N+2)
    d = 2c.*normM.^2
    dl= [-c*normM[i]*normM[i+1] for i in 1:N-1]
    Vinv = SymTridiagonal(d, dl)
    C = sqrt(M'*(Vinv*Vinv)*M)

    Astar = -(M'*Vinv)'/C
    return SWCoeffs(N, Astar[1:div(N,2)])
end

But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their SWCoeffs(10, Val{:Rahman}) are quite different:

[0.6010309510135708, 0.2975089851946684, 0.19212508233094944, 0.10979215067183574, 0.03583065785524651]

On the other hand ALGORITHM AS R94 is a standard.

kalmarek avatar Jan 04 '18 18:01 kalmarek

Thanks for contributing this. Your implementation looks good. The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Another issue is the license. http://lib.stat.cmu.edu/apstat/ states that

The Royal Statistical Society holds the copyright to these routines, but has given its permission for their distribution provided that no fee is charged.

which is not compatible with the MIT license so either you should try to implement these directly from the paper or try to contact the Royal Statistical Society and ask them if they have changed their mind regarding the clause. The best would be if they could use a real license.

andreasnoack avatar Jan 05 '18 08:01 andreasnoack

The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Yep, I tried to do this, but failed obtaining the "true" SWCoeffs by semi-exact means

Even obtaining the "exact" expected values of order statistics E(r:n) (our vector m) seems to be a problem. Convention: we obtain the first half of -order statistics N(0,1).

A simple MC produces highly variable results:

function mMC(N::Int,samps=10^5)
    X = Array{Float64}(N)
    s = zeros(div(N,2))
    for i in 1:samps
        for j in 1:N
            X[j] = randn()
        end
        sort!(X)
        for j in 1:div(N,2)
            s[j] += X[j]
        end
    end
    return -s/samps
end
d = mMC(101, 10^7) - mMC(101, 10^7);
norm(d,1)
0.0019123699000122493

A more or less exact order statistics can be obtained by integration directly, eg. following

Algorithm AS 177, Royston, J.P. Expected Normal Order Statistics (Exact and Approximate)

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

function Royston(n,r, logs=[log(i) for i in 1:n], o=7*sqrt(n))
    C = sum(logs) - sum(view(logs, 2:r-1)) - sum(view(logs, 2:n-r))
    I(x) = C + (r-1)*log(normccdf(x)) + (n-r)*log(normcdf(x)) - 1/2*(log(2π) + x^2)
    return quadgk(x->x*exp(I(x)), -9.0-log(n),9.0+log(n), 
            order=floor(Int, o), 
            maxevals=floor(Int, log(n)*10^7))
end
function mRoyston(n, o=7*sqrt(n))
    logs=[log(i) for i in 1:n]
    m = [Royston(n,r, logs, o) for r in 1:div(n,2)]
    err = norm([i[2] for i in m],1)
    @show err
    return [i[1] for i in m]
end

Royston's Algoritm AS R94 uses Blom approximation:

Blom(n,r) = -norminvcdf((r - 3/8)/(n + 1/4))
mBlom(n) = [Blom(n,r) for r in 1:div(n,2)]

and Rahman (cited above) used a simple

Rahman(n,r) = -norminvcdf(r/(n + 1))
mRahman(n) = [Rahman(n,r) for r in 1:div(n,2)]

Comparing

I believe the direct integration to be the most accurate, so I'd lean towards regarding it as "true". Comparing other methods against it gives

const N = 5000
m_Ro = mRoyston(N);
m_mc = mMC(N, 10^6);
m_Bl = mBlom(N);
m_Rh = mRahman(N);

p = plot(yaxis=:log10, title="Relative to Royston",size=(800,600))
for (X, l) in [(m_mc, "MonteCarlo"), 
        (m_Bl, "Blom approximation"), 
        (m_Rh, "Rahman approximation"), 
   ]
    plot!(abs.((X-m_Ro)./m_Ro), label=l)
end
p

newplot 2

Question:

  1. Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)
  2. Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

kalmarek avatar Jan 07 '18 01:01 kalmarek

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

Both should be high quality and quadgk gives can provide an estimate of the error. Notice also that there are normlogcdf which might give you a little extra accuracy over the explicit log.

So all-in-all I think this should be fine for computing the mean order statistics. Have you tried to compute the covariances yet? They are needed for fitting the polynomials, right?

  1. Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)

I guess we could do that. We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

Given how inaccuratly they computed the order statistics back then, I guess it would be good if we could recompute them.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

andreasnoack avatar Jan 08 '18 10:01 andreasnoack

For covariances I plan to do the following: [DavisStephens1977] goves an algorithm to compute covariance matrix of normal order statistics given

  1. $V_11$ - variance of the largers order statistics → Approximated by [SheaScallon1988]
  2. $EX_1$, $EX_2$ - the expected value of the largest and second largest order statistics → ~exact by [Royston1982]
  3. $\Sigma^2$ - sum of squared of order statistics → Approximated (with bounds on the error) by [Balakrishnan1984]
  • [DavidStephens1977]: C. S. Davis & M. A. Stephens The covariance matrix of normal order statistics Communications in Statistics - Simulation and Computation Vol. 6 , Iss. 1, 1977 DOI:10.1080/03610917708812028
  • [SheaScallon1988]: B. L. Shea and A. J. Scallon Remark AS R72: A Remark on Algorithm AS 128. Approximating the Covariance Matrix of Normal Order Statistics Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 37, No. 1 (1988), pp. 151-155 DOI:10.2307/2347514
  • [Royston1982]: J. P. Royston Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165 DOI:10.2307/2347982
  • [Balakrishnan1984]: N. Balakrishnan Algorithm AS 200: Approximating the Sum of Squares of Normal Scores Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 33, No. 2 (1984), pp. 242-245 DOI:10.2307/2347458

But I may not have enough time to this during this week.

We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

We may have a look at speeds and decide when I finish the plan above; but I guess cached (or @memoized?) version of SWCoeff(N::Int) will be the way to go.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

Roystons AS R94 is essentially Weisberg-Bingham + corrections + fitting for the distribution of W-statistic. AS R94 uses Bloms expectation of normal order statistics m and then corrects just the first two coefficients This is an approximation of (the perfect) Shapiro-Wilk, if You like. If we can (on modern hardware) do it the "exact" way (and make it fast) maybe we should?

kalmarek avatar Jan 08 '18 13:01 kalmarek

Another thing, do you have experience with Weisberg-Bingham version?

As a matter of fact, I'm not a statistician, doing pure math, coding in julia "after hours" ;-). But for most of my "empirical science" friends SW was the one to go to, and seems to have the best power for small samples.

kalmarek avatar Jan 08 '18 13:01 kalmarek

Ok, variances/covariances turned out harder than I thought: I implemented the exact formulas from Godwin Some Low Moments of Order Statistics in normordstats.jl, please have a look. These become numerically unstable for n ~ 25, but for n=20 agree to ~10 digits with Table 1 of Parrish. He used 128 bits of precision to compute tables of products of order statistics up to n=50 (and I guess he didn't progress any further because of loss of precision). So maybe my cheap implementation isn't that bad ;-)

(BTW. Do You have, by a chance, access to a paper copy of Parrishs article (to scan and send me)? The digital one I have is without text layer and of such poor quality that OCRing those tables produces errors in every row. Which for an article to BORROW for 24h for $50 is OUTRAGEOUS. )

With these I can tell that Roystons approximation of SWCoeffs is surprisingly good, whereas Rahmans is (surprisingly) bad? Maybe there is an error in my implementation of the latter:

blom(n,i) = norminvcdf((i - 3/8)/(n + 1/4))
rahman(n,i) = norminvcdf(i/(n + 1))

struct Blom
    n::Int
end

struct Rahman
    n::Int
end

OrderStatistics.expectation(B::Blom) = [blom(B.n,i) for i in 1:B.n]
OrderStatistics.expectation(R::Rahman) = [rahman(R.n,i) for i in 1:R.n]

# M. Mahibbur Rahman & Z. Govindarajulu,  
# A modification of the test of Shapiro and Wilk for normality  
# *Journal of Applied Statistics* 24:2, 219-236, 1997  
# DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828)

function invV(N, M)
    pdfM = normpdf.(M)
    d = pdfM.^2
    dl= [-pdfM[i]*pdfM[i+1] for i in 1:N-1]
    
    return (N+1)*(N+2)*SymTridiagonal(d, dl)
end

function HypothesisTests.SWCoeffs(R::Rahman)    
    M = expectation(R)
    A = invV(R.n, M)
    C = sqrt(M'*(A*A)*M)

    Astar = (M'*A)'/C
    return SWCoeffs(R.n, Astar[1:div(R.n,2)])
end

function HypothesisTests.SWCoeffs(OS::NormOrderStatistic)
    m = expectation(OS)
    iV = inv(cov(OS))
    A = m'*iV./sqrt(m'*iV*iV*m)
    return SWCoeffs(OS.n, A[1,1:div(OS.n,2)])
end

Then:

N = 25
@time ExactA = SWCoeffs(NormOrderStatistic(N));
plot(SWCoeffs(N).A, label="Blom + 0-correlation OrdStats")
plot!(SWCoeffs(Rahman(N)).A, label="Rahman + limit inverse correlation")
plot!(-ExactA.A, label="Exact formulas, BigFloat")

newplot 4 (You can see instability kicking in)

Right now I can not think of anything better than the mentioned above approximation by Stephen&David (using Taylor expansion of the expectation of product of order statistics).

Using BigFloats, ArbFloats, etc will not help much, as the precision of e.g. erf, hence normcdf is limited to Float64, at least for now.

kalmarek avatar Jan 25 '18 18:01 kalmarek

It took more time than I had predicted, but turned out a fun project ;-) I'm looking for a piece of advise what to do with the code, based on the results below. At the moment I think that numerically Shapiro-Wilk Test should be left "as is", until someone could point what did Royston fit his polynomials against. We should of course use the exact values of Parrish for N≤20 (or even N≤50 if I will be able to fix the loss of precision mentioned below).

Normal order statistics

For implementation see https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats Using this code You can confirm the covariances computed by Parrish (the 25dp values) for N≤20, but beyond that point there seems to be precision loss, which I was not able to trace.

David-Johnson formula:

function diffG(x)
    s = normpdf(x)
    s² = s^2
    s⁴ = s²*s²
    x² = x^2
    x⁴ = x²*x²
    
    return (
        1/s, 
        x/s², 
        (1 + 2x²)/(s*s²), 
        x*(7 + 6x²)/s⁴, 
        (7 + 46x² + 24*x⁴)/(s*s⁴),
        x*(127 + 326x² + 120x⁴)/(s²*s⁴)
        )
end
function covDavidJohnson(i::Int,j::Int,n::Int, order=3)
    if j < i
        return covDavidJohnson(j,i,n)
    end
        
    pᵢ, pⱼ = i/(n+1), j/(n+1)
    qᵢ, qⱼ = 1 - pᵢ, 1 - pⱼ
    Gᵢ = diffG(norminvcdf(pᵢ))
    Gⱼ = diffG(norminvcdf(pⱼ))
    X = 1/(n+2)
    T = Vector{eltype(Gᵢ)}()
    if order >= 1
        T1 = X  *pᵢ*qⱼ*Gᵢ[1]*Gⱼ[1]
        push!(T, T1)
    end
    if order >= 2
        T2 = X^2*pᵢ*qⱼ*(
                (qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] + 
                (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                0.5(
                    pᵢ*qᵢ*Gᵢ[3]*Gⱼ[1] +
                    pⱼ*qⱼ*Gᵢ[1]*Gⱼ[3] +
                    pᵢ*pⱼ*Gᵢ[2]*Gⱼ[2] 
                    )
                )
        push!(T, T2)
    end
    if order >= 3
        T3 = X^3*pᵢ*qⱼ*(
                -(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] - (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                ((qᵢ-pᵢ)^2 - pᵢ*qᵢ)*Gᵢ[3]*Gⱼ[1] + 
                ((qⱼ-pⱼ)^2 - pⱼ*qⱼ)*Gᵢ[1]*Gⱼ[3] +
                1/2*(3(qᵢ-pᵢ)*(qⱼ-pⱼ) + pⱼ*qᵢ - 4pᵢ*qⱼ)*Gᵢ[2]*Gⱼ[2] + 
                5/6*pᵢ*qᵢ*(qᵢ-pᵢ)*Gᵢ[4]*Gⱼ[1] + 
                5/6*pⱼ*qⱼ*(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[4] + 
                (pᵢ*qⱼ*(qᵢ-pᵢ) + 0.5pᵢ*qᵢ*(qⱼ-pⱼ))*Gᵢ[3]*Gⱼ[2] + 
                (pᵢ*qⱼ*(qⱼ-pⱼ) + 0.5pⱼ*qⱼ*(qᵢ-pᵢ))*Gᵢ[2]*Gⱼ[3] + 
                1/8*(pᵢ^2*qᵢ^2*Gᵢ[5]*Gⱼ[1] + pⱼ^2*qⱼ^2*Gᵢ[1]*Gⱼ[5]) + 
                1/4*(pᵢ^2*qᵢ*qⱼ*Gᵢ[4]*Gⱼ[2] + pᵢ*pⱼ*qⱼ^2*Gᵢ[2]*Gⱼ[4]) + 
                1/12*(2pᵢ^2*qⱼ^2 + 3pᵢ*pⱼ*qᵢ*qⱼ)*Gᵢ[3]*Gⱼ[3]
                )
        push!(T, T3)
    end
    return sum(T)
end

Comparing to the high-precision (25d) values obtained by Parrish (N = 20; log10.(abs.(differences))): newplot 5

DavisStephens Covariance Matrix of Normal order Statistics

This method uses the David-Johnson formula + relations between correlations, etc:

function normalizerel!(v::AbstractVector, relative::Union{Vector{Int}, UnitRange})
    Sr = sum(v[relative])
    c = (1-Sr)/(sum(v) - Sr)
    for i in eachindex(v)
        if !(i in relative)
            v[i] *= c
        end
    end
    return v
end
function fillcov!(V, j::Int)
    # copy reversed j-th column to (end-j)-th column
    @views V[j:end-j+1, end-j+1] .= reverse(V[j:end-j+1, j])
    # copy j-th column to j-th row
    @views V[j, j+1:end-j] .= V[j+1:end-j, j]
    # copy (end-j)-th column to (end-j)-th row 
    @views V[end-j+1, j+1:end-j] .= V[j+1:end-j, end-j+1] 
    return V
end
function covDavisStephens(OS::NormOrderStatistic{T}, V11::T=var(OS, 1)) where T
    V = zeros(T, OS.n, OS.n)
    
    V[1,1] = V11
    V[2,1] = moment(OS, 1, 2) - expectation(OS,1)*expectation(OS,2) - T(1)
    
    for i in 3:OS.n
        V[i, 1] = covDavidJohnson(i, 1, OS.n)
    end
    normalizerel!(view(V, :, 1), [1,2])
    fillcov!(V, 1)
    j=1

    for j in 2:div(OS.n, 2)
        V[j,j] = var(OS, j)
        for i in j+1:OS.n-j+1
            V[i,j] = covDavidJohnson(i, j, OS.n)
        end
        normalizerel!(view(V, :, j), [collect(1:j); collect(OS.n-j+1:OS.n)])
        fillcov!(V, j)
    end
    
    if isodd(OS.n) # the middle point
        j = div(OS.n, 2)+1 
        V[j, j] = var(OS, j)
        normalizerel!(view(V, :, j), [j])
        fillcov!(V, j)
    end
    return Symmetric(V, :U)
end

covDavisStephens(N::Int) = covDavisStephens(NormOrderStatistic(N))

Again comparing against high-precision values of Parrish (N = 20; log10.(abs.(differences))): newplot 6

The formula is pretty quick:

N = 2000
@time M = covDavisStephens(N)
@time invM = inv(M);
heatmap(abs.(invM./maximum(invM)), size=(800,700), yflip=true)
# 2.853262 seconds (12.26 M allocations: 365.602 MiB, 2.97% gc time)
# 0.552725 seconds (22 allocations: 62.043 MiB, 1.02% gc time)

newplot 7

Shapiro-Wilk coefficients

Finally the coefficients:

swcoeff(invcov, m) = ((m'*invcov)'/sqrt(m'*(invcov*invcov)*m))[1:div(length(m),2)]

Relative to the exact (=using high-precision covariance by Parrish) value of SW coefficients:

N=20
m = Float64.(expectation(NormOrderStatistic{BigFloat}(N)))
exact = -swcoeff(inv(Float64.(covd[N])), m)
p = plot(yaxis=:log10)
# plot!(exact, label="Exact (Parrish)")
plot!(abs.(SWCoeffs(N).A)./exact, label="Royston")
plot!(abs.(swcoeff(eye(N), expectation(Blom(N))))./exact, label="Weisberg-Bingham")
plot!(abs.(swcoeff(inv(covDavisStephens(N)), m))./exact, label="DavisStephens")

newplot 8

I still have no idea what Royston fitted his coefficients against, but that must have been pretty accurate!

for N=2000, the third coefficient of Davis-Stephens is still larger than it should by ~25% (relative to Royston): newplot 9

kalmarek avatar Mar 02 '18 13:03 kalmarek

@kalmarek Sorry for dropping the ball here. What is the current status?

andreasnoack avatar Jun 05 '18 11:06 andreasnoack

I also had too much to do, couldn't finish this cleanly;)

long story short we either:

  1. use Royston approximation (already in place), forgetting that he fit using 32-bits of precision (as everybody do)
  2. use Parrish exact values (N≤25) of covariances to compute exact SW-coeffs and Royston otherwise (as everybody should do)
  3. try to get the order statistics going (and pre-compute the SW-coeffs, e.g. for N ≤ 50, or 100), and use Royston method above;
    • I tried my own solution to compute exact covariances of normal order statistics here https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats, but as noted above, there is some problem with precision loss which I was not able to track down;
    • use Nemo's high quality acb_integrate https://github.com/Nemocas/Nemo.jl/blob/master/src/arb/acb_calc.jl (was the next thing on my list, but never happened)

Happy to finish this, starting the next week, actually (conferences, etc.)

kalmarek avatar Jun 17 '18 21:06 kalmarek

This dedicated work on Shapiro-Wilk-Test is exceptional fantastic 🥇

Are there any news to complete the work on the SWT?

Ph0non avatar Nov 15 '18 12:11 Ph0non

@Ph0non this effort (it was fun!) is an offspring of my frustration on the teaching I was forced to do. and yes, that included teaching statistics ;-) that semester is over, so I don't have enough frustration in my store right to forge it into code ;-)

but if You are willing to provide documentation, I may look into finishing this, but only in December (conferences abound).

kalmarek avatar Nov 15 '18 16:11 kalmarek

Do you need documentation like the other Tests in the package or something more (with more theoretical background or so)?

Ph0non avatar Nov 16 '18 17:11 Ph0non

There are three things to be finished here:

  1. fix problems with accuracy
  2. find out what Royston fit his approximation to
  3. produce test suite
  4. write documentation

I volunteer to finish 1 (and maybe 2) If I succeed then 3 should be rather easy, but not zero-work task. In the documentation (4) we should (among others) carefully explain:

  • why do we differ from standard packages (everybody out there is using Royston),
  • how do we obtain our SW coeffs
  • why everybody should be using exact order statistics in SW (at least for N≤50)

kalmarek avatar Nov 17 '18 22:11 kalmarek

Some small fixes https://github.com/kalmarek/HypothesisTests.jl/pull/1

Ph0non avatar Nov 20 '18 06:11 Ph0non

Hi, thanks for the work done so far.

Have you check python implementation? It could be helpful, it will allow to cross check results as well as check coefficients.

I cannot remember right now, by heart, how the coefficients are computed in the python version but I believe they are for double precision and are not cover by any copyright. It could be helpful to cross check this info.

I had also implemented a version in the fashion for tran at some point, it could come handy.

albz avatar Nov 28 '18 12:11 albz

@albz could You point me to a python implementation? the one from scipy https://github.com/scipy/scipy/blob/v1.1.0/scipy/stats/morestats.py#L1304 uses standard fortran code, the approximating polynomials were not fit in double precision there.

right now copyright is not a problem, since I wrote everything from scratch following the paper only.

kalmarek avatar Nov 28 '18 12:11 kalmarek

I see that people are generally interested so here is some details. In 1992 paper Royston

  1. obtained (by 10000-MC samples) what he calls "exact" coefficients (A),
  2. took Weisberg-Bingham approximation (C) and
  3. fit two quintic polynomials (in n^{1/2}) to differences A_n - C_n and A_{n-1} - C_{n-1}. Coefficients of those polynomials are visible in the fortran code as C1 and C2;
  4. His coefficients for SW consist of corrected C_n and C_{n-1} and re-normalized rest of C.

I'm happy with both options here:

  • providing a pre-computed (but this time exact!) coefficients A for sizes up to e.g. 200 and then
  • compare the exact coefficients (again) with Weisberg-Bingham and use the Roystons method for approximating the rest, as above.

Note that my experiments with MC show that it can not provide better precision than 1e-5 or 1e-6;

Then additional fitting must be provided for the distribution of W-statistic. Royston claims it's close to two ~lognormals for two different sets of parameters which needs to be refitted again (these are C3-C7 in the fortran code).

kalmarek avatar Nov 28 '18 13:11 kalmarek

You are right, the python-fortran code is in single precision. That is of little help.

albz avatar Nov 29 '18 00:11 albz

@kalmarek It's really a shame that this never was completed given all the work you put into this. So hereby my encouragement to resume the work.

andreasnoack avatar Aug 20 '20 07:08 andreasnoack

@andreasnoack Thanks for the encouragement ;) However unlikely it is, SW test is still in the back of my head. Hopefully we will be soon bringing Arblib.jl to a usable state and then I could use its certified integration to re-compute SW coefficients. The plan is just as it was before:

  • use exact SW coeffs for small number of samples, say N≤200 (using Arblib)
  • use Royston approximation afterwards (this is implemented) (but SW test becomes pretty useless for large sample sizes :)
  • re-fit polynomials (logpolynomials?) to distribution(s) of W-statistics and find critical thresholds (TODO)

kalmarek avatar Aug 20 '20 09:08 kalmarek

@andreasnoack in https://github.com/kalmarek/ShapiroWilk/pull/1 SWCoeffs are implemented using Arblib; one can do (an embarrassingly parallel) computation e.g. for (n=50, prec=128), which gives in a couple of minutes ~ 7 digits of precision; one gets more than Float64 has for prec=256.

To compute critical/p-values I'd need to

  1. approximate a normalizing transform (1-Wₙ)^t ~ N(µ, σ) (µ could be computed analytically, I'm not sure about σ) via MC
  2. fit a (exponential? biexponential?, inverse log?) function to t = t(n) and use it for all sample sizes
  3. use z-scores for the transformed W-statistic

I used of the shelf Optim.jl methods to do so:

using Statistics
using Random

using Distributions
using LinearAlgebra
using Optim


using Revise
using ShapiroWilk


function sample_Wstatistic(sw::ShapiroWilk.SWCoeffs{T}, sample_size) where T
    Wstats = zeros(T, sample_size)
    tmp = similar(Wstats, length(sw))
    for i in 1:sample_size
        randn!(tmp)
        sort!(tmp)
        Wstats[i] = ShapiroWilk.Wstatistic(tmp, sw)
    end
    Wstats
end;

function normal_power_fit(
    t,
    sample,
    qs = 0.005:0.005:0.995,
    normal_qs=quantile.(Normal(0,1), qs),
)

    sample_t = similar(sample)

    for idx in 1:length(sample_t)
        sample_t[idx] = (one(eltype(sample)) - sample[idx])^t
    end

    µ, σ = mean(sample_t), std(sample_t)
    for idx in 1:length(sample_t)
        sample_t[idx] = (sample_t[idx] - μ)/σ
    end

    #@assert issorted(sample_t)

    sample_t_qs = quantile(sample_t, qs, sorted=true)

    for idx in 1:length(sample_t_qs)
        sample_t_qs[idx] = sample_t_qs[idx] - normal_qs[idx]
    end

    return norm(sample_t_qs, 2)
end

optionally e.g.

@time let n = 30, prec=128
    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @time ShapiroWilk.OrderStatisticsArblib._precompute_ψ(n, prec=prec, R=18.0)
end

then

exponents = map(5:30) do n
    prec=128
    qs=0.005:0.005:0.995

    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @info "Computing coefficients..."
    @time swfl = ShapiroWilk.SWCoeffs(n, Float64.(ShapiroWilk.SWCoeffs(OS)))

    @info "Sampling W-statistic..."
    @time W_sample = sort!(sample_Wstatistic(swfl, 10_000_000), rev=true)

    @info "Fitting power-transform for normality..."
    @time res = let qs=qs
        normal_qs=quantile.(Normal(0,1), qs)

        optimize(
            x->normal_power_fit(first(x), W_sample, qs, normal_qs),
            0.01,
            0.5,
            Brent(),
        )
    end
    @info "For n=$n optimization yielded" res

    Optim.minimizer(res)
end

yields the normalizing exponents, i.e. τ such that (1-W)^τ ~ N(µ, σ). fitting of τ=τ(n) is straightforward, e.g.

┌ Info: For n=30 optimization yielded
│   res =
│    Results of Optimization Algorithm
│     * Algorithm: Brent's Method
│     * Search Interval: [0.010000, 0.500000]
│     * Minimizer: 5.386236e-02
│     * Minimum: 6.233191e-02
│     * Iterations: 12
│     * Convergence: max(|x - x_upper|, |x - x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true
└     * Objective Function Calls: 13

The whole table of these τs look as follows

tau_fit_5:30

kalmarek avatar Jan 09 '21 12:01 kalmarek

Was just looking into Shapiro-Wilk, am I reading it right that it's pretty close?

BioTurboNick avatar Jan 18 '21 07:01 BioTurboNick

@BioTurboNick You can already use code from this branch directly (well, the only thing that doesn't work is censored samples) for an implementation based on Royston approximation of S-W test; (well updating to modern HypothesisTests will require a bit coding, but not much).

What is (relatively) close is a new approximation that I'm doing in my spare time ;) Whether it will make a difference, I don't know, but it will be much closer to "mathematical correctness" (whatever that means:).


EDIT: just to let you know: everybody else (R, scipy) are using the same old fortran implementation which will be comparable to what is in this branch.

kalmarek avatar Jan 18 '21 10:01 kalmarek

@andreasnoack or anyone else in this thread ;) I'd need your advice.

I have the SW-coeffs computed up to eps(Float64) precision for n=3:189 (I could go further, but I wasn't patient enough ;).

what is left is the critical levels for the W-statistic, i.e. wstatcdf. Royston/Parrish use different normalizing transforms (functions that take W to Z ~N(0, 1))

  • for n∈4:11 approximate/interpolate pdf/cdf of W=W_n using MC. Royston in 1993 suggests -ln(γ-ln(1-W)) as normalizing transform, but this seems very crude, given that pdf seems non-differentiable. I'm not even sure this is taken into the account in the existing Shapiro-Wilk implementations.
  • find a normalizing transform for n∈12:2000 and estimate its parameters based on MC for n∈12:189. Here there are two candidates: W → (ln(1-W) - µ)/σ (Royston) or W → ((1-W)^t - µ)/σ (Parrish):

After looking closely at MC samples I arrived at this:

  • n = 3 → analytic solution
  • n = 4:6 → MC + Piecewise fit ??
  • n = 7:28 → Parrishes power transform (1-W)^t ~ N(µₙ,σₙ): we need to store (tₙ, µₙ, σₙ)
  • n >= 29 → Roystons lognormal log(1-W) → N(µ′ₙ, σ′ₙ): fit biexps/someother models to µ′, and σ′ on n = 29:189 and use it for all n.

(The transition Parrish → Royston is determined based on MSE on ~200 uniformly distributed quantiles)

What do you think?

I asked about the shape of the W-distributions on discourse: https://discourse.julialang.org/t/can-anyone-tell-me-what-is-this-distribution/53726 but got no input ;)

kalmarek avatar Feb 26 '21 15:02 kalmarek

ok, it seems that I was fitting Normal distribution wrong;) Box-Cox transform W → (1-W)^λ/λ for n in 7:189 seems superior to shifted logarithmic, when optimizing log likelihood (currently using BoxCoxTrans.jl).

using 1_000_000 MC samples of W-statistic taken from normal (for each of n∈7:189) I computed λ (the exponent of Box-Cox transform), µ and σ (the mean and std of the transformed sample of W) as functions of n

  • inverse(t, a₀, a₁, γ) = a₀ + a₁/(t - γ)
  • logarithmic(t, a₀, a₁, γ) = a₀ + a₁*log(t - γ). The models (green) were fit using least squares for n=7:140 (blue) and you can see how well they fit the "test" data (in red). The bottom plot is normalized residuals with their mean and std.

λ_7:140 µ_7:140 σ_7:140

λ_7:140.pdf µ_7:140.pdf σ_7:140.pdf

@andreasnoack As I really have very little knowledge about statistics/fitting models/parameters estimation I need someone knowledgeable to have a look at this and provide a stamp of approval ;)

kalmarek avatar Mar 02 '21 14:03 kalmarek

@kalmarek @andreasnoack just curious if this is still being looked into/implemented

jarredclloyd avatar Dec 09 '22 00:12 jarredclloyd

@jarredclloyd this kind of stalled because I was not able to produce a meaningful fits (except eyeballing plots and saying: this looks seriously gooood...). If you're willing to help (and have more experience with this) be my guest

kalmarek avatar Dec 09 '22 14:12 kalmarek

Definitely willing to help, but I think the specific problem you are trying to overcome might be beyond my level of knowledge currently. Correct me if I'm wrong, but the implementation here has a working version consistent with other languages? I think it would be great to have this implemented into HypothesisTest so that Julia has a functional Shapiro-Wilk test, especially given all the fantastic work you've done on this. From there we could attempt to resolve the fitting issue (I can ask some of my colleagues who might be able to help). So, what needs to be completed to have a functional version of Shapiro-Wilk that can be merged into the main branch?

jarredclloyd avatar Dec 09 '22 23:12 jarredclloyd

@jarredclloyd an old todo is here https://github.com/JuliaStats/HypothesisTests.jl/pull/124#issuecomment-439651770;

Minimal set:

  • [ ] produce more tests (optional)
  • [x] write documentation (what is implemented follows the paper of Royston precisely)
  • [x] Resolve the conflict (trivial) + update to the newest version of the HT.

Can you try 3) ?

Extended set:

  1. perform computations for accurate SWCoeffs (I computed those for n≤400, done)
  2. extrapolate these for larger n by finding corrections to Weisberg-Bingham method (for a few first + normalizing the rest)
  3. fit a model of the normalizing transform for all n based on the first n=400
  4. produce documentation for this.

Others:

  1. implement confint ?

kalmarek avatar Dec 11 '22 01:12 kalmarek

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Files Coverage Δ
src/HypothesisTests.jl 77.08% <ø> (-1.49%) :arrow_down:
src/shapiro_wilk.jl 97.33% <97.33%> (ø)

... and 27 files with indirect coverage changes

:loudspeaker: Thoughts on this report? Let us know!.

codecov-commenter avatar Dec 11 '22 19:12 codecov-commenter