StatsBase.jl
StatsBase.jl copied to clipboard
corspearman when all the values are NaNs
It looks like an tiedrank's error:
julia> corspearman([1.,2.,3.,4.,5.,6.], [NaN, NaN, NaN, NaN, NaN, NaN])
1.0
julia> rank([NaN, NaN, NaN, NaN, NaN, NaN])
ERROR: MethodError: `rank` has no method matching rank(::Array{Float64,1})
julia> tiedrank([NaN, NaN, NaN, NaN, NaN, NaN])
6-element Array{Float64,1}:
1.0
2.0
3.0
4.0
5.0
6.0
At least we agree with R
julia> R"rank(c(NaN,NaN,NaN,NaN,NaN,NaN))"
RCall.RObject{RCall.RealSxp}
[1] 1 2 3 4 5 6
Maybe it should throw an error or return NaN:
The simplest option would be add the following to corspearman:
if mean(isnan(x)) == 1.0
return NaN
end
if mean(isnan(y)) == 1.0
return NaN
end
I'm not sure. It seems to me that we are actually kind of following the logic of NaNs here, but I agree that it would be unfortunate to draw conclusions based on NaNs made invisible by cor. @simonbyrne what do you say?
Related issue: https://github.com/JuliaStats/StatsBase.jl/issues/2 @andreasnoack solved it using any(isnan(... in https://github.com/JuliaStats/StatsBase.jl/commit/32b8504600bf48eb8e7d2bedd0cd6059ed18ac5c#diff-27950e5b50e5bee2cb2c80262640f76fR9
Why was that reversed at some point? Returning NaN like Base.cor sounds like the best option to me:
julia> Base.cor([1.,2.,3.,4.,5.,6.], [NaN, NaN, NaN, NaN, NaN, NaN])
NaN
Ha. Funny that I actually made that fix. I'd completely forgotten that. All those years. I can see I forgot to add a test for the NaN case which is probably why it was reverted at some point.
I'm not sure if any in the IEEE 754 working group knew or cared about statistics so maybe we shouldn't feel too restricted by their recommendations here but @simonbyrne likes to explore the weird world of IEEE 754 so we should wait for his comment.
Related: https://github.com/JuliaLang/julia/issues/6486
returning a NaN in the presence of NaNs is probably the best answer.
Or same as https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.nanrankdata for people coming from python.