CategoricalArrays.jl
CategoricalArrays.jl copied to clipboard
`categorical` with levels and recoding at once
I looked through the issues but didn't see something comparable, excuse me if I missed something and duplicate old discussions.
Whenever I work with categorical data, it's usually something simple like "male"/"female", but often coded in the original dataset with placeholders such as 1 and 2 or 'm' and 'f'. So if I want a categorical array with "male" "female" I have to take two steps, create the array and then recode. I feel like it would be more straightforward to allow recoding at creation of the data, that could also be faster if there's a lot of data. I'm thinking about an API with a vector of pairs like this:
arr = [1, 2, 2, 1, 2, 1]
cat = categorical(arr, levels = [2 => "female", 1 => "male"])
So you can see that this both allows to set the categorical values that I want, and at the same time allows to set the ordering that differs from the natural 1, 2 sequence.
I think usually one would need to do something like this:
cat = recode(categorical(arr, levels = [2, 1]), 1 => "male", 2 => "female")
This gets more cumbersome the more levels there are and two full arrays need to be created.
I've been looking for the same functionality. You have two cases in mind. The one with where arr ⊆ [1,2] works like this.
CategoricalArray{String,1}(
arr,
CategoricalPool(Dict("female" => 2, "male" => 1))
)
(I think it's undocumented though)
@nalimilan, shouldn't it be possible to construct a CategoricalArray from a refarray and leveldict? Is there a specific reason this doesn't exist? Would you mind a PR making categorical(refarray, leveldict) possible? E.g.
function categorical(refarray::AbstractArray{R, N},
invleveldict::Dict{V,R},
ordered=false
) where {N, V, R <: Integer}
CategoricalArray{V,N}(refarray, CategoricalPool(invleveldict, ordered))
end
Probably one could also allow leveldict::Dict{R,V} and !(R :< Int) (which is the other case arr ⊆ ['m','f'] @jkrumbiegel mentioned)
Yeah this definitely makes sense. I haven't implemented these yet because I concentrated on getting the basics right, without working too much on convenience. But feel free to make a PR.
There are a few subtle issues to address though:
- Do we want to just wrap the input vector or to make a copy? I'd tend to avoid a copy, given that array constructors tend to be wrappers. Though
CategoricalArraycould avoid copying, butcategoricalcould make a copy (possibly with an argument to choose the best behavior). - If we don't make a copy, we are forced to use the input vector's type as the reference type. This may not be what is intended in general as often one has a
Vector{Int}input (as in @jkrumbiegel's example), butUInt32takes twice less memory, is faster to process (e.g. for grouping) and reduces the amount of recompilation of functions (since it's the default type). - When adding different constructors, we must ensure no ambiguity can happen (now or later). As @greimel noted, they could take either
invleveldictorleveldictas the second argument. Yet I don't think it's possible to distinguish these in dispatch sinceDict{Int, Int}could be both. One solution would be to pass these as keyword arguments to distinguish them, though that wouldn't allow inferring the return type. We could also decide that the two-argument constructors would always take the refs as the first argument, so it would make more sense that the second argument would either be a vector of levels orleveldict.