CategoricalArrays.jl icon indicating copy to clipboard operation
CategoricalArrays.jl copied to clipboard

cut combined with labels and extend throws ArgumentError

Open laborg opened this issue 7 years ago • 9 comments
trafficstars

Considering the thrown ArgumentError the behaviour of cut is unclear:

julia> CategoricalArrays.cut([1,3], [2]; extend=true, labels=["a","b"])
2-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"

julia> CategoricalArrays.cut([2,3], [2]; extend=true, labels=["a","b"])
ERROR: ArgumentError: labels must be of length 1, but got length 2
Stacktrace:
 [1] #cut#68(::Bool, ::Array{String,1}, ::Bool, ::Function, ::Array{Int64,1}, ::Array{Int64,1}) at /home/gerhard/.julia/packages/CategoricalArrays/04bks/src/extras.jl:113
 [2] (::getfield(CategoricalArrays, Symbol("#kw##cut")))(::NamedTuple{(:extend, :labels),Tuple{Bool,Array{String,1}}}, ::typeof(cut), ::Array{Int64,1}, ::Array{Int64,1}) at ./none:0
 [3] top-level scope at none:0

Is this intended? I would have assumed, that with extend==true the first and the last label would be taken for values outside the borders.

Empty intervals shouldn't be a reason for an error, otherwise this would also need to fail:

CategoricalArrays.cut([0,10],[3,7];extend=true, labels=["a","b","c"])
2-element CategoricalArray{String,1,UInt32}:
 "a"
 "c"

laborg avatar Sep 25 '18 13:09 laborg

Is this intended? I would have assumed, that with extend==true the first and the last label would be taken for values outside the borders.

The way extend currently works is that it adds the minimum/maximum value to the breaks if it's outside of them. When you provide a single cut point and there's a single value outside of the breaks, a second break is added, and you get a single interval. So cut([2,3], [2]; extend=true) is equivalent to cut([2,3], [2,3]), and there's a single interval.

What output would you expect in that case? What do other software do?

nalimilan avatar Sep 25 '18 15:09 nalimilan

My problem is with the number of labels depending on the value of breaks in relation to x.

I would have expected following (note the two element labels vector):

cut([2,3], [6]; extend=true, labels=["a","b"])
"a"
"a"

cut([2,3], [0]; extend=true, labels=["a","b"])
"b"
"b"

But instead it tells me that only one label is expected:

cut([2,3], [6]; extend=true, labels=["a","b"])
ERROR: ArgumentError: labels must be of length 1, but got length 2
Stacktrace:
 [1] #cut#68(::Bool, ::Array{String,1}, ::Bool, ::Function, ::Array{Int64,1}, ::Array{Int64,1}) at /home/gerhard/.julia/packages/CategoricalArrays/04bks/src/extras.jl:113
 [2] (::getfield(CategoricalArrays, Symbol("#kw##cut")))(::NamedTuple{(:extend, :labels),Tuple{Bool,Array{String,1}}}, ::typeof(cut), ::Array{Int64,1}, ::Array{Int64,1}) at ./none:0
 [3] top-level scope at none:0

While this works well:

cut([2,3], [2.5]; extend=true, labels=["a","b"])
2-element CategoricalArray{String,1,UInt32}:
 "a"
 "b"

laborg avatar Sep 25 '18 16:09 laborg

OK, so you would expect to always get an empty class. But what labels should we use for these classes when you don't provide them? Or should we only do that when labels are provided explicitly?

Can you check what other software do in that case? I guess your expectations come from a previous experience?

nalimilan avatar Sep 25 '18 16:09 nalimilan

My experience for cut comes from R, but R doesn't offer an extend option so a comparison wouldn't be helpful.

I'll still think I'm misunderstood: If you cut something at a certain point, and define everything left to the cut point as "a" and everything right of the cut point as "b", it shouldn't throw an error if the thing you want to cut is completely on the left (or right) of the cut point, as you might not control the input.

More drastically, the call cut(rand(1:10,2),[5]; extend=true, labels=["a","b"]) throws an error in roughly 50% of the cases, while this cut(rand(1:10,2),[5]; extend=true) never fails...

In the mean time I stopped relying on the combination of extend=true and labels and fallen back to cut(x,[-Inf, cutpoint, Inf]; extend=false, labels=["a","b"]) to achieve what I want.

laborg avatar Sep 25 '18 17:09 laborg

The point of extend is (as the docs say) to add new breaks to include all values. That's useful especially if you want those extreme values to automatically be used in the class labels. If what you want is to include intervals from -Inf to +Inf better do that explicitly.

nalimilan avatar Sep 27 '18 15:09 nalimilan

Okay I'll do that, but can you give me any example with extend=true and labels=[xxx] set that doesn't fail on a certain input?

laborg avatar Sep 27 '18 15:09 laborg

There aren't. That's kind of the point of extend.

nalimilan avatar Sep 27 '18 15:09 nalimilan

Update: I just realized that the current breaks/label behaviour is kind of ambiguous and doesn't fit with my proposal.

I don't want to take more time from you, but I think this proposal would make cut more useful: (condition is n umber of l abels in relation to n umber of b reaks)

Current behaviour

condition labels breaks no extend extend left extend right extend both
nl==nb+1 a, b 3 - error error a..3..b
nl==nb a, b 3, 5 error a..3..b..5 3..a..5..b error
nl==nb-1 a, b 3, 5, 7 3..a..5..b..7 error error error
nl==nb+1 a, b, c 3, 5 error error error a..3..b..5..c

Proposed behaviour

condition labels breaks no extend extend left extend right extend both
nl==nb+1 a, b 3 - a..3 3..b a..3..b
nl==nb a, b 3, 5 error error error error
nl==nb-1 a, b 3, 5, 7 3..a..5..b..7 error error error
nl==nb+1 a, b, c 3, 5 3..b..5 a..3..b..5 3..b..5..c a..3..b..5..c
...

laborg avatar Sep 27 '18 17:09 laborg

I think that's too clever and too likely to confuse users. Better add another argument which would add -Inf and +Inf to the breaks if you find it useful.

nalimilan avatar Sep 27 '18 20:09 nalimilan