CategoricalArrays.jl
CategoricalArrays.jl copied to clipboard
cut combined with labels and extend throws ArgumentError
Considering the thrown ArgumentError the behaviour of cut is unclear:
julia> CategoricalArrays.cut([1,3], [2]; extend=true, labels=["a","b"])
2-element CategoricalArray{String,1,UInt32}:
"a"
"b"
julia> CategoricalArrays.cut([2,3], [2]; extend=true, labels=["a","b"])
ERROR: ArgumentError: labels must be of length 1, but got length 2
Stacktrace:
[1] #cut#68(::Bool, ::Array{String,1}, ::Bool, ::Function, ::Array{Int64,1}, ::Array{Int64,1}) at /home/gerhard/.julia/packages/CategoricalArrays/04bks/src/extras.jl:113
[2] (::getfield(CategoricalArrays, Symbol("#kw##cut")))(::NamedTuple{(:extend, :labels),Tuple{Bool,Array{String,1}}}, ::typeof(cut), ::Array{Int64,1}, ::Array{Int64,1}) at ./none:0
[3] top-level scope at none:0
Is this intended? I would have assumed, that with extend==true the first and the last label would be taken for values outside the borders.
Empty intervals shouldn't be a reason for an error, otherwise this would also need to fail:
CategoricalArrays.cut([0,10],[3,7];extend=true, labels=["a","b","c"])
2-element CategoricalArray{String,1,UInt32}:
"a"
"c"
Is this intended? I would have assumed, that with
extend==truethe first and the last label would be taken for values outside the borders.
The way extend currently works is that it adds the minimum/maximum value to the breaks if it's outside of them. When you provide a single cut point and there's a single value outside of the breaks, a second break is added, and you get a single interval. So cut([2,3], [2]; extend=true) is equivalent to cut([2,3], [2,3]), and there's a single interval.
What output would you expect in that case? What do other software do?
My problem is with the number of labels depending on the value of breaks in relation to x.
I would have expected following (note the two element labels vector):
cut([2,3], [6]; extend=true, labels=["a","b"])
"a"
"a"
cut([2,3], [0]; extend=true, labels=["a","b"])
"b"
"b"
But instead it tells me that only one label is expected:
cut([2,3], [6]; extend=true, labels=["a","b"])
ERROR: ArgumentError: labels must be of length 1, but got length 2
Stacktrace:
[1] #cut#68(::Bool, ::Array{String,1}, ::Bool, ::Function, ::Array{Int64,1}, ::Array{Int64,1}) at /home/gerhard/.julia/packages/CategoricalArrays/04bks/src/extras.jl:113
[2] (::getfield(CategoricalArrays, Symbol("#kw##cut")))(::NamedTuple{(:extend, :labels),Tuple{Bool,Array{String,1}}}, ::typeof(cut), ::Array{Int64,1}, ::Array{Int64,1}) at ./none:0
[3] top-level scope at none:0
While this works well:
cut([2,3], [2.5]; extend=true, labels=["a","b"])
2-element CategoricalArray{String,1,UInt32}:
"a"
"b"
OK, so you would expect to always get an empty class. But what labels should we use for these classes when you don't provide them? Or should we only do that when labels are provided explicitly?
Can you check what other software do in that case? I guess your expectations come from a previous experience?
My experience for cut comes from R, but R doesn't offer an extend option so a comparison wouldn't be helpful.
I'll still think I'm misunderstood: If you cut something at a certain point, and define everything left to the cut point as "a" and everything right of the cut point as "b", it shouldn't throw an error if the thing you want to cut is completely on the left (or right) of the cut point, as you might not control the input.
More drastically, the call cut(rand(1:10,2),[5]; extend=true, labels=["a","b"]) throws an error in roughly 50% of the cases, while this cut(rand(1:10,2),[5]; extend=true) never fails...
In the mean time I stopped relying on the combination of extend=true and labels and fallen back to cut(x,[-Inf, cutpoint, Inf]; extend=false, labels=["a","b"]) to achieve what I want.
The point of extend is (as the docs say) to add new breaks to include all values. That's useful especially if you want those extreme values to automatically be used in the class labels. If what you want is to include intervals from -Inf to +Inf better do that explicitly.
Okay I'll do that, but can you give me any example with extend=true and labels=[xxx] set that doesn't fail on a certain input?
There aren't. That's kind of the point of extend.
Update: I just realized that the current breaks/label behaviour is kind of ambiguous and doesn't fit with my proposal.
I don't want to take more time from you, but I think this proposal would make cut more useful:
(condition is n umber of l abels in relation to n umber of b reaks)
Current behaviour
| condition | labels | breaks | no extend | extend left | extend right | extend both |
|---|---|---|---|---|---|---|
| nl==nb+1 | a, b | 3 | - | error | error | a..3..b |
| nl==nb | a, b | 3, 5 | error | a..3..b..5 | 3..a..5..b | error |
| nl==nb-1 | a, b | 3, 5, 7 | 3..a..5..b..7 | error | error | error |
| nl==nb+1 | a, b, c | 3, 5 | error | error | error | a..3..b..5..c |
Proposed behaviour
| condition | labels | breaks | no extend | extend left | extend right | extend both |
|---|---|---|---|---|---|---|
| nl==nb+1 | a, b | 3 | - | a..3 | 3..b | a..3..b |
| nl==nb | a, b | 3, 5 | error | error | error | error |
| nl==nb-1 | a, b | 3, 5, 7 | 3..a..5..b..7 | error | error | error |
| nl==nb+1 | a, b, c | 3, 5 | 3..b..5 | a..3..b..5 | 3..b..5..c | a..3..b..5..c |
| ... |
I think that's too clever and too likely to confuse users. Better add another argument which would add -Inf and +Inf to the breaks if you find it useful.