Query.jl
Query.jl copied to clipboard
@groupby failed on CategoricalString with missing value
Let's say, we try to @groupby
on a CategoricalString
column
df = DataFrame(
fruit = ["Apple", "Banana", "Cherry", "Apple", "Banana", "Cherry"],
price = [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])
categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
The result is exactly as expected.
3×2 DataFrame
│ Row │ key │ m │
│ │ Categorical… │ Float64 │
├─────┼──────────────┼─────────┤
│ 1 │ Apple │ 1.2 │
│ 2 │ Banana │ 2.0 │
│ 3 │ Cherry │ 0.4 │
Next, we make the same column contain some missing values.
df = DataFrame(
fruit = ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
price = [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])
categorical!(df, :fruit)
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
Then, if failed with an error message stating somewhat like this.
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous
However, this behavior is weird since String
column with similar missing values does not issue any complaints.
df = DataFrame(
fruit = ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
price = [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])
df |> @groupby(_.fruit) |> @map({key=key(_), m=mean(_.price)})
key │ m
─────────┼────
"Apple" │ 1.2
#NA │ 2.0
"Cherry" │ 0.4
"Banana" │ 2.0
This limitation may have to be fixed, since it is natural that String
columns are recoded as Categorical
columns before beginning any analysis.
Thanks.
Similar behavior found on @filter
missing values.
df = DataFrame(
fruit = ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"],
price = [1.2, 2.0, 0.4, 1.2, 2.0, 0.4])
categorical!(df, :fruit)
df |> @filter(!isna(_.fruit)) |> DataFrame
It fails with the same ERROR message:
LoadError: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous
This example works without any error if String
colums were not converted into Categorical
columns or if there were not any missing values.
The same kind of error persists with simpler types of filtering.
df |> @filter(_.fruit == "Banana") |> DataFrame
Also fails with the same ERROR message.
Conclusively, @filter
or @groupby
seems not work with Categorical
variables with missing value.
Thanks.
Same for
df |> @select(:price)
or
df |> @filter(true)
(with the above DataFrame
).
So it seems that the problem occurs when transforming the DataFrame
to a Query.jl
iterator.
This is actually unrelated to Query.jl, one can get the same error with:
using IteratorInterfaceExtensions
it = getiterator(df)
first(it)
The implementation for that lives in Tables.jl.
The root problem though seems to be that both DataValues.jl and CategoricalArrays.jl add methods to convert
that don't play well together.
I'm a bit lost what to do here, maybe @quinnj has an idea? Also CCing @nalimilan, the resident categorical expert. Is this a case that needs special casing in Tables.jl?
I haven't been able to reproduce locally (even after updating all packages). Is there anything special I need to do? I have DataValues 0.4.12.
Here is what I'm using:
(repo1) pkg> st
Status `C:\Users\david\.julia\environments\repo1\Project.toml`
[a93c6f00] DataFrames v0.19.2
[e7dc6d0d] DataValues v0.4.12
[82899510] IteratorInterfaceExtensions v1.0.0
And the code:
julia> using DataFrames, IteratorInterfaceExtensions, DataValues
julia> df = DataFrame(fruit= ["Apple", missing, "Cherry", "Apple", "Banana", "Cherry"], price= [1.2, 2.0, 0.4, 1.2, 2.0, 0.4]);
julia> categorical!(df, :fruit);
julia> it = getiterator(df);
julia> first(it)
ERROR: MethodError: convert(::Type{DataValue{CategoricalString{UInt32}}}, ::CategoricalString{UInt32}) is ambiguous. Candidates:
convert(::Type{S}, x::T) where {S, T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)} in CategoricalArrays at C:\Users\david\.julia\packages\CategoricalArrays\xjesC\src\value.jl:91
convert(::Type{DataValue{T}}, x::T) where T in DataValues at C:\Users\david\.julia\packages\DataValues\XQWvG\src\scalar\core.jl:31
Possible fix, define
convert(::Type{DataValue{T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R)}}, ::T<:(Union{CategoricalString{R}, CategoricalValue{T,R} where T} where R))
Stacktrace:
[1] macro expansion at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:102 [inlined]
[2] iterate(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}, ::Tuple{}) at C:\Users\david\.julia\packages\Tables\FXXeK\src\tofromdatavalues.jl:96 (repeats 2 times)
[3] first(::Tables.DataValueRowIterator{NamedTuple{(:fruit, :price),Tuple{DataValue{CategoricalString{UInt32}},Float64}},Tables.RowIterator{NamedTuple{(:fruit, :price),Tuple{CategoricalArray{Union{Missing, String},1,UInt32,String,CategoricalString{UInt32},Missing},Array{Float64,1}}}}}) at .\abstractarray.jl:342
[4] top-level scope at REPL[11]:1
Make sure you load DataValues
, otherwise Tables.jl will use a different code path.
Still no error with these versions:
(v1.2) pkg> st
Status `~/.julia/environments/v1.2/Project.toml`
[6e4b80f9] BenchmarkTools v0.4.2
[336ed68f] CSV v0.5.0 [`~/.julia/dev/CSV`]
[324d7699] CategoricalArrays v0.5.5 [`~/.julia/dev/CategoricalArrays`]
[a93c6f00] DataFrames v0.19.2
[e7dc6d0d] DataValues v0.4.12
[82899510] IteratorInterfaceExtensions v1.0.0
[2dfb63ee] PooledArrays v0.5.2+ [`~/.julia/dev/PooledArrays`]
[1a8c2f83] Query v0.12.1
[295af30f] Revise v1.0.2
[bd369af6] Tables v0.2.6
@nalimilan: here is a reproducer repo that you can run with this:
Thanks. So this is indeed a legitimate ambiguity which cannot be fixed separately in either package. I think @quinnj's https://github.com/JuliaLang/julia/pull/32613 should help fixing this, but that won't work for current Julia versions. Maybe we should define that unwrap
method in Compat or in DataAPI. But first we should bump that PR so that at least it's merged in master.
Do you think this is a new problem? I still don't understand why I didn't observe it in my original tests.