StatsBase.jl ECDF evaluated on NaN is 1.0

trafficstars

I was looking at some strange results in my code when I discovered this:

julia> ecdf(randn(100))(NaN)
1.0

Sep 14 '18 13:09 joshday

https://github.com/JuliaStats/StatsBase.jl/blob/master/src/empirical.jl#L18

julia> searchsortedlast(randn(10), NaN) / 10
1.0

Sep 14 '18 22:09 ararslan

Some more ecdf fun:

julia> ecdf([1,2,NaN])(Inf)
0.6666666666666666

Sep 15 '18 20:09 joshday

What would be the general consensus with the following behavior:

ecdf(x) => ecdf(filter(!isnan, x))
ecdf(x)(NaN) => NaN

or something to that effect? I think it would be less surprising...

I would not be opposed to ecdf(x)(Inf) => 1, and ecdf(x)(-Inf) => 0, but I feel less strongly about these.

Nov 14 '19 18:11 tpoisot

The general convention in Julia is that NaN are never ignored, but propagate or throw errors. So ecdf(x) should probably just throw an error, or create a function which always returns NaN. ecdf(x)(NaN) should return NaN too.

ecdf(x)(Inf) == 1 and ecdf(x)(-Inf) == 0 kind of make sense to me.

Nov 15 '19 13:11 nalimilan

+1 from me for:

ecdf(x) throwing an error if any(isnan, x)
ecdf(x)(Inf) == 1
ecdf(x)(-Inf) == 0
isnan(ecdf(x)(NaN))

I'll try to make a PR today.

Nov 15 '19 14:11 joshday

@joshday I think that makes sense -- users can always filter whatever they give to ecdf to remove NaN before.

Nov 15 '19 14:11 tpoisot

Different issue here but along the same lines. How would folks expect Inf to work with ecdf? Current behavior (matches R) is:

julia> ecdf([1,2,Inf])(Inf). # Works because Inf == Inf
1.0

julia> ecdf([1,2,-Inf])(-Inf)
0.3333333333333333

A case could be made to also disallow Inf/-Inf in creating ECDFs.

Nov 25 '19 14:11 joshday

StatsBase.jl StatsBase.jl copied to clipboard

ECDF evaluated on NaN is 1.0

StatsBase.jl
StatsBase.jl copied to clipboard