CategoricalArrays.jl icon indicating copy to clipboard operation
CategoricalArrays.jl copied to clipboard

Add `Vector` conversion

Open pdeffebach opened this issue 5 years ago • 13 comments

Currently there is an inconsistency between Array and Vector conversions for categorical arrays

julia> x = CategoricalArray([1, 1, 1, 2, 2, 3]);

julia> Array(x)
6-element Array{Int64,1}:
 1
 1
 1
 2
 2
 3

julia> Vector(x)
6-element Array{CategoricalValue{Int64,UInt32},1}:
 1
 1
 1
 2
 2
 3

pdeffebach avatar Aug 18 '20 16:08 pdeffebach

As mentioned on slack, a better solution overall would be to make Vector return a CategoricalArray and have an explicit uncategorize method.

pdeffebach avatar Aug 18 '20 16:08 pdeffebach

Vector may not return CategoricalArray, as Vector has a very concrete meaning in Base (it should call a constructor of Vector that should allocate a fresh Vector as opposed to e.g. convert functions).

bkamins avatar Aug 18 '20 16:08 bkamins

Good catch. These indeed have to return an Array, but we should decide whether to return an Array{Int} or an Array {<:CategoricalValue{Int}}. This is related to whether we keep the current similar and collect overloads, which ensure Array{<:CategoricalValue} is never produced: if we drop them, we could return Array{<:CategoricalValue{Int}}, which is somewhat more logical than Array{Int}.

EDIT: though reading the Slack thread it seems that Array{Int} is what was expected (in that case at least)

nalimilan avatar Aug 19 '20 16:08 nalimilan

It was expected for consistency with Array. However, I think that in these cases we could allow some flexibility and change the behaviour if it helps in other areas.

bkamins avatar Aug 19 '20 17:08 bkamins

if we drop them, we could return Array{<:CategoricalValue{Int}}, which is somewhat more logical than Array{Int}.

I think most people converting to a Array are just saying "I want out! give me numbers again!". I think Array{<:CategoricalValue{Int}} is fine as long as we give people an explicit decategorize command.

pdeffebach avatar Aug 19 '20 20:08 pdeffebach

broadcasted get now almost does it (except that it fails on missing)

bkamins avatar Aug 20 '20 05:08 bkamins

Do we have a plan for get to work with missing?

I just had a frustrating experience with factors in R and thought Julia would be nicer, but this is still annoying.

pdeffebach avatar Feb 12 '21 19:02 pdeffebach

passmissing(get)?

nalimilan avatar Feb 12 '21 19:02 nalimilan

Ah that does work. Fair enough!

I don't really understand the choice of get to be honest, since a categorical value isn't a collection. If we made our own function we could define it for missing. But I appreciate the need to get to 1.0 and it's not that big a deal.

pdeffebach avatar Feb 12 '21 20:02 pdeffebach

Actually get is also my long-standing gripe https://github.com/JuliaData/CategoricalArrays.jl/issues/142. I would change it if @nalimilan supported this (and then passmissing would not be needed).

bkamins avatar Feb 12 '21 22:02 bkamins

AFAICT we agreed on using unwrap at #142. We just need to move the definition from Tables.jl to DataAPI.jl.

nalimilan avatar Feb 13 '21 18:02 nalimilan

@quinnj - would you have time to make this move? Then we can update CategoricalArrays.jl

bkamins avatar Feb 13 '21 18:02 bkamins

PR up: https://github.com/JuliaData/DataAPI.jl/pull/35. So I realized we don't really need to move anything from Tables.jl; the definitions there are...not really related and not really necessary. Like, they're not useful generically. So we just define the single unwrap(x) = x definition in DataAPI.jl that CategoricalArrays can overload.

quinnj avatar Feb 14 '21 00:02 quinnj