Query.jl icon indicating copy to clipboard operation
Query.jl copied to clipboard

`@where` does not preserve CategoricalArray

Open alyst opened this issue 8 years ago • 7 comments

Julia v0.6, DataFrames, CategoricalArray and Query masters

julia> using DataFrames, Query

julia> df = DataFrame(a = categorical(["a", "b", "a", "c"]))
4×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ b │
│ 3   │ a │
│ 4   │ c │

julia> typeof(df[:a])
CategoricalArrays.CategoricalArray{String,1,UInt32,String,Union{}}

julia> dff = df |> @where(_.a != "b") |> DataFrame
3×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │ a │
│ 2   │ a │
│ 3   │ c │

julia> typeof(dff[:a])
Array{CategoricalArrays.CategoricalValue{String,UInt32},1}

IIUC Array{CategoricalValue} is suboptimal in terms of storage. Also CategoricalArrays API like levels(dff[:a]) is not available.

alyst avatar Oct 11 '17 21:10 alyst

Indeed Array{CategoricalValue} isn't supposed to be used. It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

nalimilan avatar Oct 12 '17 19:10 nalimilan

I have some minimal solution to this, by it still involves moving+extending Enumerable abstract type definitions from QueryOperators.jl to TableTraits.jl and using them from IterableTables.jl. So it's pretty invasive. I can open PRs, but I suppose the core JuliaData contributors already have some strategy how to fix this and the other design problems (e.g. IterableTables.jl being dependent on every data-related package).

alyst avatar Oct 12 '17 23:10 alyst

The proper way to implement this would be to make sure that at collection into a DataFrame any column of type CategoricalValue uses the proper container type. Essentially just another elseif case here. Having said that, I think anything like that would require a bit more thinking about larger implications. I was a couple of times close to just adding this, but in the end I'm just not yet sure whether I want this in the iterable tables interface or not (main tradeoff is complexity, I'm trying to keep iterable tables as simple as possible, and I'm not sure this would still fall into that category).

davidanthoff avatar Oct 24 '17 22:10 davidanthoff

It's kind of annoying that we don't have a way to specify that a given element type should always be stored in a given container type.

YES, I completely agree. I think it would be absolutely fantastic if there was a standard way in base to say "for element type T, the default array type should be X", and if things like [a,b,c] (where a, b and c are instances of T) constructed an X, and and fill etc. all used that. I think a simple function like getdefaultarraytype(::Type{T}) where T = Array{T} in base would probably do it, where custom Ts could add a method and the various base array creation methods would rely on it.

I suggested something like that in https://github.com/JuliaLang/julia/issues/22630, but probably doomed all chances of success of that proposal by embedding it in a crazy rename idea ;) I think a much less ambitious proposal along the lines of the above paragraph would still be a great idea.

davidanthoff avatar Oct 24 '17 22:10 davidanthoff

Maybe we should just override similar(::Array, ::Type{<:CategoricalValue})?

nalimilan avatar Oct 25 '17 08:10 nalimilan

Where would that be picked up?

I guess I was so far mostly worrying about [a,b,c] when a, b and c are of type CategoricalValue (or in my case DataValue) and how one can intercept that. I think we can already intercept CategoricalValue[a,b,c], and this line seems key to intercepting all cases.

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

davidanthoff avatar Nov 16 '17 01:11 davidanthoff

I kind of feel that if someone is asking for an Array specifically, he/she should get that specific type, right? Not sure...

In this case, the caller wouldn't require an Array, just an object "similar" to the input Array, so I'd say it wouldn't be absurd to return a CategoricalArray if the element type is a categorical value.

nalimilan avatar Nov 16 '17 10:11 nalimilan