StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

corspearman when all the values are NaNs

Open diegozea opened this issue 9 years ago • 8 comments

It looks like an tiedrank's error:

julia> corspearman([1.,2.,3.,4.,5.,6.], [NaN, NaN, NaN, NaN, NaN, NaN])
1.0

julia> rank([NaN, NaN, NaN, NaN, NaN, NaN])
ERROR: MethodError: `rank` has no method matching rank(::Array{Float64,1})

julia> tiedrank([NaN, NaN, NaN, NaN, NaN, NaN])
6-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0
 5.0
 6.0

diegozea avatar May 23 '16 18:05 diegozea

At least we agree with R

julia> R"rank(c(NaN,NaN,NaN,NaN,NaN,NaN))"
RCall.RObject{RCall.RealSxp}
[1] 1 2 3 4 5 6

andreasnoack avatar May 23 '16 19:05 andreasnoack

Maybe it should throw an error or return NaN: The simplest option would be add the following to corspearman:

        if mean(isnan(x)) == 1.0 
            return NaN
        end
        if mean(isnan(y)) == 1.0
            return NaN
        end

diegozea avatar May 23 '16 20:05 diegozea

I'm not sure. It seems to me that we are actually kind of following the logic of NaNs here, but I agree that it would be unfortunate to draw conclusions based on NaNs made invisible by cor. @simonbyrne what do you say?

andreasnoack avatar May 23 '16 23:05 andreasnoack

Related issue: https://github.com/JuliaStats/StatsBase.jl/issues/2 @andreasnoack solved it using any(isnan(... in https://github.com/JuliaStats/StatsBase.jl/commit/32b8504600bf48eb8e7d2bedd0cd6059ed18ac5c#diff-27950e5b50e5bee2cb2c80262640f76fR9 Why was that reversed at some point? Returning NaN like Base.cor sounds like the best option to me:

julia> Base.cor([1.,2.,3.,4.,5.,6.], [NaN, NaN, NaN, NaN, NaN, NaN])
NaN

diegozea avatar May 24 '16 00:05 diegozea

Ha. Funny that I actually made that fix. I'd completely forgotten that. All those years. I can see I forgot to add a test for the NaN case which is probably why it was reverted at some point.

I'm not sure if any in the IEEE 754 working group knew or cared about statistics so maybe we shouldn't feel too restricted by their recommendations here but @simonbyrne likes to explore the weird world of IEEE 754 so we should wait for his comment.

andreasnoack avatar May 24 '16 01:05 andreasnoack

Related: https://github.com/JuliaLang/julia/issues/6486

simonster avatar May 24 '16 01:05 simonster

returning a NaN in the presence of NaNs is probably the best answer.

simonbyrne avatar May 24 '16 09:05 simonbyrne

Or same as https://kwgoodman.github.io/bottleneck-doc/reference.html#bottleneck.nanrankdata for people coming from python.

xgdgsc avatar Sep 16 '23 10:09 xgdgsc