HypothesisTests.jl
HypothesisTests.jl copied to clipboard
Shapiro-Wilk normality test
implements ShapiroWilkTest following
PATRICK ROYSTON Approximating the Shapiro-Wilk W-test for non-normality Statistics and Computing (1992) 2, 117-119 DOI: 10.1007/BF01891203
This is work in progress, all comments are welcome! I tried to follow the paper closely, but copied e.g. computed constants from the original swilk.f. Please let me know if You are ok (license-wise) with this.
These polynomial interpolations are computed at loadtime for speed. Currently
julia> using BenchmarkTools
julia> srand(1);
julia> k = 5000;
julia> X = sort(randn(k));
julia> @btime ShapiroWilkTest(X);
112.523 μs (18 allocations: 137.45 KiB)
julia> swc = SWCoeffs(k);
julia> @btime ShapiroWilkTest(X, swc);
62.476 μs (11 allocations: 78.50 KiB)
whereas calling swilkfort
directly
julia> srand(1);
julia> X = sort(randn(5000));
julia> A = zeros(2500);
julia> @btime swilkfort!(X, A);
205.763 μs (10 allocations: 224 bytes)
julia> @btime swilkfort!(X, A, false);
75.692 μs (10 allocations: 224 bytes)
Still missing:
- [x] documentation
- [ ] censored data
- [ ] polynomial interpolation with better precision (?),
and probably more.
I tried to compute exact values of SWCoeffs
(via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 op. cit. Woule be glad if anyone could help.
> function approximate_A(N, samps)
s = [sort(randn(N)) for i in 1:samps]
m = sum(s[i][1:N] for i in 1:samps)/samps
ss = vcat(s...)
Vinv = inv(cov(ss, ss))
A = (-m'*Vinv/sqrt(m'*Vinv*Vinv*m))'
return ((A - reverse(A))/2)[1:div(N,2)]
end;
approximate_A(10,1000000)
results in
[0.5469493640066541, 0.3559315327467397, 0.23322491989287789, 0.1336531031109471, 0.043609401791162974]
Compare swilk.f
's, and SWCoeffs(10).A
:
[0.5737146934337275, 0.3289700603561813, 0.21434902886647542, 0.12279063037794058, 0.04008871216291867]
[0.5737147066903874, 0.3289700464878161, 0.21434901803439887, 0.1227906248657577, 0.04008871105102476]
Rahman & Govindarajulu
A different implementation of SWCoeffs following
M. Mahibbur Rahman & Z. Govindarajulu,
A modification of the test of Shapiro and Wilk for normality
Journal of Applied Statistics 24:2, 219-236, 1997
DOI: 10.1080/02664769723828
function HypothesisTests.SWCoeffs(N::Int, ::Type{Val{:Rahman}})
M = [norminvcdf(i/(N+1)) for i in 1:N];
normM = normpdf.(M)
c = (N+1)*(N+2)
d = 2c.*normM.^2
dl= [-c*normM[i]*normM[i+1] for i in 1:N-1]
Vinv = SymTridiagonal(d, dl)
C = sqrt(M'*(Vinv*Vinv)*M)
Astar = -(M'*Vinv)'/C
return SWCoeffs(N, Astar[1:div(N,2)])
end
But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their SWCoeffs(10, Val{:Rahman})
are quite different:
[0.6010309510135708, 0.2975089851946684, 0.19212508233094944, 0.10979215067183574, 0.03583065785524651]
On the other hand ALGORITHM AS R94
is a standard.
Thanks for contributing this. Your implementation looks good. The main issue is that the swilk.f
is written for Float32
and not Float64
so you'll probably need to compute new polynomial coefficient for the Float64
implementation.
Another issue is the license. http://lib.stat.cmu.edu/apstat/ states that
The Royal Statistical Society holds the copyright to these routines, but has given its permission for their distribution provided that no fee is charged.
which is not compatible with the MIT license so either you should try to implement these directly from the paper or try to contact the Royal Statistical Society and ask them if they have changed their mind regarding the clause. The best would be if they could use a real license.
The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.
Yep, I tried to do this, but failed obtaining the "true" SWCoeffs
by semi-exact means
Even obtaining the "exact" expected values of order statistics E(r:n)
(our vector m
) seems to be a problem. Convention: we obtain the first half of -order statistics N(0,1)
.
A simple MC produces highly variable results:
function mMC(N::Int,samps=10^5)
X = Array{Float64}(N)
s = zeros(div(N,2))
for i in 1:samps
for j in 1:N
X[j] = randn()
end
sort!(X)
for j in 1:div(N,2)
s[j] += X[j]
end
end
return -s/samps
end
d = mMC(101, 10^7) - mMC(101, 10^7);
norm(d,1)
0.0019123699000122493
A more or less exact order statistics can be obtained by integration directly, eg. following
Algorithm AS 177
, Royston, J.P. Expected Normal Order Statistics (Exact and Approximate)
But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?
function Royston(n,r, logs=[log(i) for i in 1:n], o=7*sqrt(n))
C = sum(logs) - sum(view(logs, 2:r-1)) - sum(view(logs, 2:n-r))
I(x) = C + (r-1)*log(normccdf(x)) + (n-r)*log(normcdf(x)) - 1/2*(log(2π) + x^2)
return quadgk(x->x*exp(I(x)), -9.0-log(n),9.0+log(n),
order=floor(Int, o),
maxevals=floor(Int, log(n)*10^7))
end
function mRoyston(n, o=7*sqrt(n))
logs=[log(i) for i in 1:n]
m = [Royston(n,r, logs, o) for r in 1:div(n,2)]
err = norm([i[2] for i in m],1)
@show err
return [i[1] for i in m]
end
Royston's Algoritm AS R94
uses Blom approximation:
Blom(n,r) = -norminvcdf((r - 3/8)/(n + 1/4))
mBlom(n) = [Blom(n,r) for r in 1:div(n,2)]
and Rahman (cited above) used a simple
Rahman(n,r) = -norminvcdf(r/(n + 1))
mRahman(n) = [Rahman(n,r) for r in 1:div(n,2)]
Comparing
I believe the direct integration to be the most accurate, so I'd lean towards regarding it as "true". Comparing other methods against it gives
const N = 5000
m_Ro = mRoyston(N);
m_mc = mMC(N, 10^6);
m_Bl = mBlom(N);
m_Rh = mRahman(N);
p = plot(yaxis=:log10, title="Relative to Royston",size=(800,600))
for (X, l) in [(m_mc, "MonteCarlo"),
(m_Bl, "Blom approximation"),
(m_Rh, "Rahman approximation"),
]
plot!(abs.((X-m_Ro)./m_Ro), label=l)
end
p
Question:
- Is storing/caching values of
m
(and later ofSWCoeffs
) an option (for sizes `N in 4:1:5000) - Should we follow
AS R94
for the ShapiroWilk test, or should we fit our own curves for distributions ofW
?
But I don't know how exact are
normcdf
s, etc.,quadgk
, i.e. how exact is the following method?
Both should be high quality and quadgk
gives can provide an estimate of the error. Notice also that there are normlogcdf
which might give you a little extra accuracy over the explicit log
.
So all-in-all I think this should be fine for computing the mean order statistics. Have you tried to compute the covariances yet? They are needed for fitting the polynomials, right?
- Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)
I guess we could do that. We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.
Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?
Given how inaccuratly they computed the order statistics back then, I guess it would be good if we could recompute them.
Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.
For covariances I plan to do the following: [DavisStephens1977] goves an algorithm to compute covariance matrix of normal order statistics given
- $V_11$ - variance of the largers order statistics → Approximated by [SheaScallon1988]
- $EX_1$, $EX_2$ - the expected value of the largest and second largest order statistics → ~exact by [Royston1982]
- $\Sigma^2$ - sum of squared of order statistics → Approximated (with bounds on the error) by [Balakrishnan1984]
- [DavidStephens1977]: C. S. Davis & M. A. Stephens The covariance matrix of normal order statistics Communications in Statistics - Simulation and Computation Vol. 6 , Iss. 1, 1977 DOI:10.1080/03610917708812028
- [SheaScallon1988]: B. L. Shea and A. J. Scallon Remark AS R72: A Remark on Algorithm AS 128. Approximating the Covariance Matrix of Normal Order Statistics Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 37, No. 1 (1988), pp. 151-155 DOI:10.2307/2347514
- [Royston1982]: J. P. Royston Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165 DOI:10.2307/2347982
- [Balakrishnan1984]: N. Balakrishnan Algorithm AS 200: Approximating the Sum of Squares of Normal Scores Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 33, No. 2 (1984), pp. 242-245 DOI:10.2307/2347458
But I may not have enough time to this during this week.
We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.
We may have a look at speeds and decide when I finish the plan above; but I guess cached (or @memoized
?) version of SWCoeff(N::Int)
will be the way to go.
Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.
Roystons AS R94
is essentially Weisberg-Bingham + corrections + fitting for the distribution of W
-statistic. AS R94
uses Bloms expectation of normal order statistics m
and then corrects just the first two coefficients
This is an approximation of (the perfect) Shapiro-Wilk, if You like. If we can (on modern hardware) do it the "exact" way (and make it fast) maybe we should?
Another thing, do you have experience with Weisberg-Bingham version?
As a matter of fact, I'm not a statistician, doing pure math, coding in julia "after hours" ;-). But for most of my "empirical science" friends SW was the one to go to, and seems to have the best power for small samples.
Ok, variances/covariances turned out harder than I thought:
I implemented the exact formulas from Godwin Some Low Moments of Order Statistics in normordstats.jl, please have a look. These become numerically unstable for n ~ 25
, but for n=20
agree to ~10
digits with Table 1 of Parrish. He used 128
bits of precision to compute tables of products of order statistics up to n=50
(and I guess he didn't progress any further because of loss of precision). So maybe my cheap implementation isn't that bad ;-)
(BTW. Do You have, by a chance, access to a paper copy of Parrishs article (to scan and send me)? The digital one I have is without text layer and of such poor quality that OCRing those tables produces errors in every row. Which for an article to BORROW for 24h for $50 is OUTRAGEOUS. )
With these I can tell that Roystons approximation of SWCoeffs
is surprisingly good, whereas Rahmans is (surprisingly) bad? Maybe there is an error in my implementation of the latter:
blom(n,i) = norminvcdf((i - 3/8)/(n + 1/4))
rahman(n,i) = norminvcdf(i/(n + 1))
struct Blom
n::Int
end
struct Rahman
n::Int
end
OrderStatistics.expectation(B::Blom) = [blom(B.n,i) for i in 1:B.n]
OrderStatistics.expectation(R::Rahman) = [rahman(R.n,i) for i in 1:R.n]
# M. Mahibbur Rahman & Z. Govindarajulu,
# A modification of the test of Shapiro and Wilk for normality
# *Journal of Applied Statistics* 24:2, 219-236, 1997
# DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828)
function invV(N, M)
pdfM = normpdf.(M)
d = pdfM.^2
dl= [-pdfM[i]*pdfM[i+1] for i in 1:N-1]
return (N+1)*(N+2)*SymTridiagonal(d, dl)
end
function HypothesisTests.SWCoeffs(R::Rahman)
M = expectation(R)
A = invV(R.n, M)
C = sqrt(M'*(A*A)*M)
Astar = (M'*A)'/C
return SWCoeffs(R.n, Astar[1:div(R.n,2)])
end
function HypothesisTests.SWCoeffs(OS::NormOrderStatistic)
m = expectation(OS)
iV = inv(cov(OS))
A = m'*iV./sqrt(m'*iV*iV*m)
return SWCoeffs(OS.n, A[1,1:div(OS.n,2)])
end
Then:
N = 25
@time ExactA = SWCoeffs(NormOrderStatistic(N));
plot(SWCoeffs(N).A, label="Blom + 0-correlation OrdStats")
plot!(SWCoeffs(Rahman(N)).A, label="Rahman + limit inverse correlation")
plot!(-ExactA.A, label="Exact formulas, BigFloat")
(You can see instability kicking in)
Right now I can not think of anything better than the mentioned above approximation by Stephen&David (using Taylor expansion of the expectation of product of order statistics).
Using BigFloats, ArbFloats, etc will not help much, as the precision of e.g. erf
, hence normcdf
is limited to Float64
, at least for now.
It took more time than I had predicted, but turned out a fun project ;-)
I'm looking for a piece of advise what to do with the code, based on the results below. At the moment I think that numerically Shapiro-Wilk Test should be left "as is", until someone could point what did Royston fit his polynomials against. We should of course use the exact values of Parrish for N≤20
(or even N≤50
if I will be able to fix the loss of precision mentioned below).
Normal order statistics
For implementation see https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats
Using this code You can confirm the covariances computed by Parrish (the 25dp values) for N≤20
, but beyond that point there seems to be precision loss, which I was not able to trace.
David-Johnson formula:
function diffG(x)
s = normpdf(x)
s² = s^2
s⁴ = s²*s²
x² = x^2
x⁴ = x²*x²
return (
1/s,
x/s²,
(1 + 2x²)/(s*s²),
x*(7 + 6x²)/s⁴,
(7 + 46x² + 24*x⁴)/(s*s⁴),
x*(127 + 326x² + 120x⁴)/(s²*s⁴)
)
end
function covDavidJohnson(i::Int,j::Int,n::Int, order=3)
if j < i
return covDavidJohnson(j,i,n)
end
pᵢ, pⱼ = i/(n+1), j/(n+1)
qᵢ, qⱼ = 1 - pᵢ, 1 - pⱼ
Gᵢ = diffG(norminvcdf(pᵢ))
Gⱼ = diffG(norminvcdf(pⱼ))
X = 1/(n+2)
T = Vector{eltype(Gᵢ)}()
if order >= 1
T1 = X *pᵢ*qⱼ*Gᵢ[1]*Gⱼ[1]
push!(T, T1)
end
if order >= 2
T2 = X^2*pᵢ*qⱼ*(
(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] +
(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] +
0.5(
pᵢ*qᵢ*Gᵢ[3]*Gⱼ[1] +
pⱼ*qⱼ*Gᵢ[1]*Gⱼ[3] +
pᵢ*pⱼ*Gᵢ[2]*Gⱼ[2]
)
)
push!(T, T2)
end
if order >= 3
T3 = X^3*pᵢ*qⱼ*(
-(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] - (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] +
((qᵢ-pᵢ)^2 - pᵢ*qᵢ)*Gᵢ[3]*Gⱼ[1] +
((qⱼ-pⱼ)^2 - pⱼ*qⱼ)*Gᵢ[1]*Gⱼ[3] +
1/2*(3(qᵢ-pᵢ)*(qⱼ-pⱼ) + pⱼ*qᵢ - 4pᵢ*qⱼ)*Gᵢ[2]*Gⱼ[2] +
5/6*pᵢ*qᵢ*(qᵢ-pᵢ)*Gᵢ[4]*Gⱼ[1] +
5/6*pⱼ*qⱼ*(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[4] +
(pᵢ*qⱼ*(qᵢ-pᵢ) + 0.5pᵢ*qᵢ*(qⱼ-pⱼ))*Gᵢ[3]*Gⱼ[2] +
(pᵢ*qⱼ*(qⱼ-pⱼ) + 0.5pⱼ*qⱼ*(qᵢ-pᵢ))*Gᵢ[2]*Gⱼ[3] +
1/8*(pᵢ^2*qᵢ^2*Gᵢ[5]*Gⱼ[1] + pⱼ^2*qⱼ^2*Gᵢ[1]*Gⱼ[5]) +
1/4*(pᵢ^2*qᵢ*qⱼ*Gᵢ[4]*Gⱼ[2] + pᵢ*pⱼ*qⱼ^2*Gᵢ[2]*Gⱼ[4]) +
1/12*(2pᵢ^2*qⱼ^2 + 3pᵢ*pⱼ*qᵢ*qⱼ)*Gᵢ[3]*Gⱼ[3]
)
push!(T, T3)
end
return sum(T)
end
Comparing to the high-precision (25d) values obtained by Parrish (N = 20; log10.(abs.(differences))
):
DavisStephens Covariance Matrix of Normal order Statistics
This method uses the David-Johnson formula + relations between correlations, etc:
function normalizerel!(v::AbstractVector, relative::Union{Vector{Int}, UnitRange})
Sr = sum(v[relative])
c = (1-Sr)/(sum(v) - Sr)
for i in eachindex(v)
if !(i in relative)
v[i] *= c
end
end
return v
end
function fillcov!(V, j::Int)
# copy reversed j-th column to (end-j)-th column
@views V[j:end-j+1, end-j+1] .= reverse(V[j:end-j+1, j])
# copy j-th column to j-th row
@views V[j, j+1:end-j] .= V[j+1:end-j, j]
# copy (end-j)-th column to (end-j)-th row
@views V[end-j+1, j+1:end-j] .= V[j+1:end-j, end-j+1]
return V
end
function covDavisStephens(OS::NormOrderStatistic{T}, V11::T=var(OS, 1)) where T
V = zeros(T, OS.n, OS.n)
V[1,1] = V11
V[2,1] = moment(OS, 1, 2) - expectation(OS,1)*expectation(OS,2) - T(1)
for i in 3:OS.n
V[i, 1] = covDavidJohnson(i, 1, OS.n)
end
normalizerel!(view(V, :, 1), [1,2])
fillcov!(V, 1)
j=1
for j in 2:div(OS.n, 2)
V[j,j] = var(OS, j)
for i in j+1:OS.n-j+1
V[i,j] = covDavidJohnson(i, j, OS.n)
end
normalizerel!(view(V, :, j), [collect(1:j); collect(OS.n-j+1:OS.n)])
fillcov!(V, j)
end
if isodd(OS.n) # the middle point
j = div(OS.n, 2)+1
V[j, j] = var(OS, j)
normalizerel!(view(V, :, j), [j])
fillcov!(V, j)
end
return Symmetric(V, :U)
end
covDavisStephens(N::Int) = covDavisStephens(NormOrderStatistic(N))
Again comparing against high-precision values of Parrish (N = 20; log10.(abs.(differences))
):
The formula is pretty quick:
N = 2000
@time M = covDavisStephens(N)
@time invM = inv(M);
heatmap(abs.(invM./maximum(invM)), size=(800,700), yflip=true)
# 2.853262 seconds (12.26 M allocations: 365.602 MiB, 2.97% gc time)
# 0.552725 seconds (22 allocations: 62.043 MiB, 1.02% gc time)
Shapiro-Wilk coefficients
Finally the coefficients:
swcoeff(invcov, m) = ((m'*invcov)'/sqrt(m'*(invcov*invcov)*m))[1:div(length(m),2)]
Relative to the exact (=using high-precision covariance by Parrish) value of SW coefficients:
N=20
m = Float64.(expectation(NormOrderStatistic{BigFloat}(N)))
exact = -swcoeff(inv(Float64.(covd[N])), m)
p = plot(yaxis=:log10)
# plot!(exact, label="Exact (Parrish)")
plot!(abs.(SWCoeffs(N).A)./exact, label="Royston")
plot!(abs.(swcoeff(eye(N), expectation(Blom(N))))./exact, label="Weisberg-Bingham")
plot!(abs.(swcoeff(inv(covDavisStephens(N)), m))./exact, label="DavisStephens")
I still have no idea what Royston fitted his coefficients against, but that must have been pretty accurate!
for N=2000
, the third coefficient of Davis-Stephens is still larger than it should by ~25% (relative to Royston):
@kalmarek Sorry for dropping the ball here. What is the current status?
I also had too much to do, couldn't finish this cleanly;)
long story short we either:
- use Royston approximation (already in place), forgetting that he fit using 32-bits of precision (as everybody do)
- use Parrish exact values (
N≤25
) of covariances to compute exact SW-coeffs and Royston otherwise (as everybody should do) - try to get the order statistics going (and pre-compute the SW-coeffs, e.g. for
N ≤ 50
, or100
), and use Royston method above;- I tried my own solution to compute exact covariances of normal order statistics here https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats, but as noted above, there is some problem with precision loss which I was not able to track down;
- use Nemo's high quality acb_integrate https://github.com/Nemocas/Nemo.jl/blob/master/src/arb/acb_calc.jl (was the next thing on my list, but never happened)
Happy to finish this, starting the next week, actually (conferences, etc.)
This dedicated work on Shapiro-Wilk-Test is exceptional fantastic 🥇
Are there any news to complete the work on the SWT?
@Ph0non this effort (it was fun!) is an offspring of my frustration on the teaching I was forced to do. and yes, that included teaching statistics ;-) that semester is over, so I don't have enough frustration in my store right to forge it into code ;-)
but if You are willing to provide documentation, I may look into finishing this, but only in December (conferences abound).
Do you need documentation like the other Tests in the package or something more (with more theoretical background or so)?
There are three things to be finished here:
- fix problems with accuracy
- find out what Royston fit his approximation to
- produce test suite
- write documentation
I volunteer to finish 1 (and maybe 2) If I succeed then 3 should be rather easy, but not zero-work task. In the documentation (4) we should (among others) carefully explain:
- why do we differ from standard packages (everybody out there is using Royston),
- how do we obtain our SW coeffs
- why everybody should be using exact order statistics in SW (at least for N≤50)
Some small fixes https://github.com/kalmarek/HypothesisTests.jl/pull/1
Hi, thanks for the work done so far.
Have you check python implementation? It could be helpful, it will allow to cross check results as well as check coefficients.
I cannot remember right now, by heart, how the coefficients are computed in the python version but I believe they are for double precision and are not cover by any copyright. It could be helpful to cross check this info.
I had also implemented a version in the fashion for tran at some point, it could come handy.
@albz could You point me to a python implementation? the one from scipy https://github.com/scipy/scipy/blob/v1.1.0/scipy/stats/morestats.py#L1304 uses standard fortran code, the approximating polynomials were not fit in double precision there.
right now copyright is not a problem, since I wrote everything from scratch following the paper only.
I see that people are generally interested so here is some details. In 1992 paper Royston
- obtained (by 10000-MC samples) what he calls "exact" coefficients (
A
), - took Weisberg-Bingham approximation (
C
) and - fit two quintic polynomials (in
n^{1/2}
) to differencesA_n - C_n
andA_{n-1} - C_{n-1}
. Coefficients of those polynomials are visible in the fortran code asC1
andC2
; - His coefficients for SW consist of corrected
C_n
andC_{n-1}
and re-normalized rest ofC
.
I'm happy with both options here:
- providing a pre-computed (but this time exact!) coefficients
A
for sizes up to e.g. 200 and then - compare the exact coefficients (again) with Weisberg-Bingham and use the Roystons method for approximating the rest, as above.
Note that my experiments with MC show that it can not provide better precision than 1e-5
or 1e-6
;
Then additional fitting must be provided for the distribution of W-statistic. Royston claims it's close to two ~lognormals for two different sets of parameters which needs to be refitted again (these are C3-C7 in the fortran code).
You are right, the python-fortran code is in single precision. That is of little help.
@kalmarek It's really a shame that this never was completed given all the work you put into this. So hereby my encouragement to resume the work.
@andreasnoack Thanks for the encouragement ;) However unlikely it is, SW test is still in the back of my head.
Hopefully we will be soon bringing Arblib.jl to a usable state and then I could use its certified integration to re-compute SW
coefficients.
The plan is just as it was before:
- use exact SW coeffs for small number of samples, say
N≤200
(usingArblib
) - use Royston approximation afterwards (this is implemented) (but SW test becomes pretty useless for large sample sizes :)
- re-fit polynomials (logpolynomials?) to distribution(s) of
W
-statistics and find critical thresholds (TODO)
@andreasnoack in https://github.com/kalmarek/ShapiroWilk/pull/1 SWCoeffs
are implemented using Arblib;
one can do (an embarrassingly parallel) computation e.g. for (n=50, prec=128)
, which gives in a couple of minutes ~ 7 digits of precision; one gets more than Float64 has for prec=256
.
To compute critical/p-values I'd need to
- approximate a normalizing transform
(1-Wₙ)^t ~ N(µ, σ)
(µ
could be computed analytically, I'm not sure aboutσ
) via MC - fit a (exponential? biexponential?, inverse log?) function to
t = t(n)
and use it for all sample sizes - use z-scores for the transformed
W
-statistic
I used of the shelf Optim.jl methods to do so:
using Statistics
using Random
using Distributions
using LinearAlgebra
using Optim
using Revise
using ShapiroWilk
function sample_Wstatistic(sw::ShapiroWilk.SWCoeffs{T}, sample_size) where T
Wstats = zeros(T, sample_size)
tmp = similar(Wstats, length(sw))
for i in 1:sample_size
randn!(tmp)
sort!(tmp)
Wstats[i] = ShapiroWilk.Wstatistic(tmp, sw)
end
Wstats
end;
function normal_power_fit(
t,
sample,
qs = 0.005:0.005:0.995,
normal_qs=quantile.(Normal(0,1), qs),
)
sample_t = similar(sample)
for idx in 1:length(sample_t)
sample_t[idx] = (one(eltype(sample)) - sample[idx])^t
end
µ, σ = mean(sample_t), std(sample_t)
for idx in 1:length(sample_t)
sample_t[idx] = (sample_t[idx] - μ)/σ
end
#@assert issorted(sample_t)
sample_t_qs = quantile(sample_t, qs, sorted=true)
for idx in 1:length(sample_t_qs)
sample_t_qs[idx] = sample_t_qs[idx] - normal_qs[idx]
end
return norm(sample_t_qs, 2)
end
optionally e.g.
@time let n = 30, prec=128
OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
@time ShapiroWilk.OrderStatisticsArblib._precompute_ψ(n, prec=prec, R=18.0)
end
then
exponents = map(5:30) do n
prec=128
qs=0.005:0.005:0.995
OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
@info "Computing coefficients..."
@time swfl = ShapiroWilk.SWCoeffs(n, Float64.(ShapiroWilk.SWCoeffs(OS)))
@info "Sampling W-statistic..."
@time W_sample = sort!(sample_Wstatistic(swfl, 10_000_000), rev=true)
@info "Fitting power-transform for normality..."
@time res = let qs=qs
normal_qs=quantile.(Normal(0,1), qs)
optimize(
x->normal_power_fit(first(x), W_sample, qs, normal_qs),
0.01,
0.5,
Brent(),
)
end
@info "For n=$n optimization yielded" res
Optim.minimizer(res)
end
yields the normalizing exponents, i.e. τ
such that (1-W)^τ ~ N(µ, σ)
. fitting of τ=τ(n)
is straightforward, e.g.
┌ Info: For n=30 optimization yielded
│ res =
│ Results of Optimization Algorithm
│ * Algorithm: Brent's Method
│ * Search Interval: [0.010000, 0.500000]
│ * Minimizer: 5.386236e-02
│ * Minimum: 6.233191e-02
│ * Iterations: 12
│ * Convergence: max(|x - x_upper|, |x - x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true
└ * Objective Function Calls: 13
The whole table of these τ
s look as follows
Was just looking into Shapiro-Wilk, am I reading it right that it's pretty close?
@BioTurboNick You can already use code from this branch directly (well, the only thing that doesn't work is censored samples) for an implementation based on Royston approximation of S-W test; (well updating to modern HypothesisTests will require a bit coding, but not much).
What is (relatively) close is a new approximation that I'm doing in my spare time ;) Whether it will make a difference, I don't know, but it will be much closer to "mathematical correctness" (whatever that means:).
EDIT: just to let you know: everybody else (R, scipy) are using the same old fortran implementation which will be comparable to what is in this branch.
@andreasnoack or anyone else in this thread ;) I'd need your advice.
I have the SW-coeffs computed up to eps(Float64)
precision for n=3:189
(I could go further, but I wasn't patient enough ;).
what is left is the critical levels for the W
-statistic, i.e. wstatcdf
. Royston/Parrish use different normalizing transforms (functions that take W to Z ~N(0, 1))
- for
n∈4:11
approximate/interpolate pdf/cdf ofW=W_n
using MC. Royston in 1993 suggests-ln(γ-ln(1-W))
as normalizing transform, but this seems very crude, given that pdf seems non-differentiable. I'm not even sure this is taken into the account in the existing Shapiro-Wilk implementations. - find a normalizing transform for
n∈12:2000
and estimate its parameters based on MC forn∈12:189
. Here there are two candidates:W → (ln(1-W) - µ)/σ
(Royston) orW → ((1-W)^t - µ)/σ
(Parrish):
After looking closely at MC samples I arrived at this:
-
n = 3
→ analytic solution -
n = 4:6
→ MC + Piecewise fit ?? -
n = 7:28
→ Parrishes power transform(1-W)^t ~ N(µₙ,σₙ)
: we need to store(tₙ, µₙ, σₙ)
-
n >= 29
→ Roystons lognormallog(1-W) → N(µ′ₙ, σ′ₙ)
: fit biexps/someother models to µ′, and σ′ onn = 29:189
and use it for all n.
(The transition Parrish → Royston is determined based on MSE on ~200 uniformly distributed quantiles)
What do you think?
I asked about the shape of the W-distributions on discourse: https://discourse.julialang.org/t/can-anyone-tell-me-what-is-this-distribution/53726 but got no input ;)
ok, it seems that I was fitting Normal distribution wrong;) Box-Cox transform W → (1-W)^λ/λ
for n in 7:189
seems superior to shifted logarithmic, when optimizing log likelihood (currently using BoxCoxTrans.jl
).
using 1_000_000
MC samples of W
-statistic taken from normal (for each of n∈7:189
) I computed λ
(the exponent of Box-Cox transform), µ
and σ
(the mean and std of the transformed sample of W
) as functions of n
-
inverse(t, a₀, a₁, γ) = a₀ + a₁/(t - γ)
-
logarithmic(t, a₀, a₁, γ) = a₀ + a₁*log(t - γ)
. The models (green) were fit using least squares forn=7:140
(blue) and you can see how well they fit the "test" data (in red). The bottom plot is normalized residuals with theirmean
andstd
.
λ_7:140.pdf µ_7:140.pdf σ_7:140.pdf
@andreasnoack As I really have very little knowledge about statistics/fitting models/parameters estimation I need someone knowledgeable to have a look at this and provide a stamp of approval ;)
@kalmarek @andreasnoack just curious if this is still being looked into/implemented
@jarredclloyd this kind of stalled because I was not able to produce a meaningful fits (except eyeballing plots and saying: this looks seriously gooood...). If you're willing to help (and have more experience with this) be my guest
Definitely willing to help, but I think the specific problem you are trying to overcome might be beyond my level of knowledge currently. Correct me if I'm wrong, but the implementation here has a working version consistent with other languages? I think it would be great to have this implemented into HypothesisTest so that Julia has a functional Shapiro-Wilk test, especially given all the fantastic work you've done on this. From there we could attempt to resolve the fitting issue (I can ask some of my colleagues who might be able to help). So, what needs to be completed to have a functional version of Shapiro-Wilk that can be merged into the main branch?
@jarredclloyd an old todo is here https://github.com/JuliaStats/HypothesisTests.jl/pull/124#issuecomment-439651770;
Minimal set:
- [ ] produce more tests (optional)
- [x] write documentation (what is implemented follows the paper of Royston precisely)
- [x] Resolve the conflict (trivial) + update to the newest version of the HT.
Can you try 3) ?
Extended set:
- perform computations for accurate SWCoeffs (I computed those for n≤400, done)
- extrapolate these for larger n by finding corrections to Weisberg-Bingham method (for a few first + normalizing the rest)
- fit a model of the normalizing transform for all n based on the first n=400
- produce documentation for this.
Others:
- implement
confint
?
Codecov Report
Attention: 2 lines
in your changes are missing coverage. Please review.
Files | Coverage Δ | |
---|---|---|
src/HypothesisTests.jl | 77.08% <ø> (-1.49%) |
:arrow_down: |
src/shapiro_wilk.jl | 97.33% <97.33%> (ø) |
... and 27 files with indirect coverage changes
:loudspeaker: Thoughts on this report? Let us know!.