StatsBase.jl
StatsBase.jl copied to clipboard
ECDF evaluated on NaN is 1.0
I was looking at some strange results in my code when I discovered this:
julia> ecdf(randn(100))(NaN)
1.0
https://github.com/JuliaStats/StatsBase.jl/blob/master/src/empirical.jl#L18
julia> searchsortedlast(randn(10), NaN) / 10
1.0
Some more ecdf fun:
julia> ecdf([1,2,NaN])(Inf)
0.6666666666666666
What would be the general consensus with the following behavior:
ecdf(x) => ecdf(filter(!isnan, x))
ecdf(x)(NaN) => NaN
or something to that effect? I think it would be less surprising...
I would not be opposed to ecdf(x)(Inf) => 1, and ecdf(x)(-Inf) => 0, but I feel less strongly about these.
The general convention in Julia is that NaN are never ignored, but propagate or throw errors. So ecdf(x) should probably just throw an error, or create a function which always returns NaN. ecdf(x)(NaN) should return NaN too.
ecdf(x)(Inf) == 1 and ecdf(x)(-Inf) == 0 kind of make sense to me.
+1 from me for:
ecdf(x)throwing an error ifany(isnan, x)ecdf(x)(Inf)== 1ecdf(x)(-Inf)== 0isnan(ecdf(x)(NaN))
I'll try to make a PR today.
@joshday I think that makes sense -- users can always filter whatever they give to ecdf to remove NaN before.
Different issue here but along the same lines. How would folks expect Inf to work with ecdf? Current behavior (matches R) is:
julia> ecdf([1,2,Inf])(Inf). # Works because Inf == Inf
1.0
julia> ecdf([1,2,-Inf])(-Inf)
0.3333333333333333
A case could be made to also disallow Inf/-Inf in creating ECDFs.